- Raman Walwyn-Venugopal - rsw2@illinois.edu
- Srikanth Bharadwaz Samudrala - sbs7@illinois.edu
- Satish Reddy Asi - sasi2@illinois.edu
The goal of this project is to perform topic mining and classification on telehealth encounter nursing notes for notes that documented a positive outcome for the patient form the telehealth services. To accomplish this, we divided the project into four steps;
- curating the dataset
- build topic miner and mine topics from dataset
- perform analysis on topics
- build binary classifier that attempts to predict if a document is a positive outcome for the patient
Requirements:
Note: This code was ran on ubuntu 18.04 and ubuntu 20.04
The source of the data is from TimeDocHealth that
has a care team that focuses on providing telehealth services to patients with
multiple chronic diseases. Two CSV files, each containing 10,000 records were
exported from the TimeDoc system. One file named positive_encounters.csv
contained only notes that were labelled as a positive outcome due to the
telehealth services while another file named no_positive_encounters.csv only
contained notes that weren't labelled as a positive outcome for the patient.
The format of the exported CSV files are as follows:
<note_id>,<patient_id>,<purpose>,<duration>,<note>.
<purpose>is an array of attributes of the telehealth encounter, it is selected from a pre-defined list and can provide insights to the actions of the telehealth encounter.<duration>is the total amount of time the telehealth encounter took<note>is the free-text nursing note summarizing the encounter. This data is what the topic mining and classification will be performed
To ensure we're adhering to HIPPA Privacy Guidelines, Protected Health Information (PHI) was redacted using De-Identification (DEID) Software Package. For the DEID to be effective, it had to be configured with the following lists:
pid_patientname.txt- patient names and identifiers. Was created by referencing all the patients from the two exported CSV lists and curating a file formatted with each line as<PATIENT_ID>||||<PATIENT_FIRST_NAME>||||<PATIENT_LAST_NAME>doctor_first_names.txt- doctor first names. Created by exporting each care team member for the patient such as their Primary Care Provider, Radiologist, etc.doctor_last_names.txt- doctor last names. Created using same strategy as doctor first names.unambig_local_places.txt- locations near the patient. Created using the cities, towns of addresses for patients and businesses near them.company_names.txt- company names. Created by listing out local healthcare organizations surrounding the patient.
For the DEID to perform the redaction of PHI, it required to be fed the notes in a particular format. So the exported CSV file had to be transformed to the following format:
START_OF_RECORD=<PATIENT_ID>||||<DOCUMENT_ID>||||
<DOCUMENT_CONTENT>
||||END_OF_RECORD
We accomplished this transformation for both of the CSV exported files using a
ruby script located at deid/convert_csv_to_text.rb and ran the
following commands:
# convert csv files to deid text format
ruby deid/convert_csv_to_text.rb demo_data/positive_encounters.csv
ruby deid/convert_csv_to_text.rb demo_data/no_positive_encounters.csv
The output produced two files named positive_encounters.text and
no_positive_encounters.text respectively. Afterwards we ran the DEID perl
script to remove the PHI using the following commands:
# enter deid directory
cd deid
# redact PHI from text files
perl deid.pl ../demo_data/positive_encounters deid-output.config
perl deid.pl ../demo_data/no_positive_encounters deid-output.config
The output produced two PHI redacted files named positive_encounters.res and
no_positive_encounters.res. To convert the files back into the CSV format, we
used the following script located at deid/convert_res_to_csv.rb and
ran the following commands:
# convert redacted res files to csv
ruby deid/convert_res_to_csv.rb \
demo_data/positive_encounters.res \
demo_data/positive_encounters.csv
ruby deid/convert_res_to_csv.rb \
demo_data/no_positive_encounters.res \
demo_data/no_positive_encounters.csv
The output produced two files named positive_encounters.res.csv and
no_positive_encounters.res.csv.
Note: Since the DEID is an automated too, we have to account for the
possibility of not redacting all PHI data. To minimize actual PHI distributed
50 samples were taken form both the positive_encounters.res and
no_positive_encounters.res file and manually verified to not contain PHI.
This sampled may be provided upon request by emailing
rsw2@illinois.edu
Requirements:
Python Libraries Used:
- nltk
- pandas
- numpy
- matplotlib/pylab
- regex
Extracting Documents
The source to extract documents from is the notes. The Telemedicine responses are saved as CSV files with multiple fields. "notes" from the response file is fed as Document input to our PLSA implementation. The input responses file is in CSV file and the data is delimited by "," character.
Generating stop words
Stop words are generated using standard python nltk libraries. The stopwords are saved as file and is used as input for topic_miner program. stop words can be manually edited adding any tele-medicine specific words such as patient, call, treatment, phone etc.. since these are repeated frequently in every note. stop words program is run separately and the file is saved under "patient_data" folder where the input files are placed under.
Mining Topics from Documents
The topic_miner is run with data-file (in CSV format), stop-words file as input. The additional arguments to the program include number of topics, Max Iterations, Threshold, Number of Topic words. The arguments also include the path to output files:
- Document Topic Coverage
- Topic Word Coverage
- Vocabulary
- Topic words
More details about the module are available at: topic miner
Note: Due to the slow performance of our manually written PLSA topic miner, we created topic miner v2 that uses an open source python PLSA package and produces the same documents as our home-crafted PLSA topic miner.
change to directory of topic miner
cd topic_miner_v2
Create new virtual environment
python -m venv venv
Activate virtual environment
source venv/bin/activate
Install required packages
pip install -r requirements.txt
# python topic_miner.py <path/to/encounters.res.csv> <number_of_topics>
python topic_miner.py ../demo_data/all_encounters.res.csv 10
Output would be:
# topic coverage of topic probability per document in corpus
all_encounters.res.csv.10-doc-topic-cov.txt
#grouping of words and probabilities of topic per line
all_encounters.res.csv.10-topic-word-probs-grouped.txt
#all the probabilities for each topic per line
all_encounters.res.csv.10-topic-word-probs.txt
#all the words for each topic per line
all_encounters.res.csv.10-topics.txt
#vocabulary of corpus
all_encounters.res.csv.vocab
Requirements:
This topic analysis script performs analysis on the results of the topic miner when both the positive and non-positive encounters are included in the whole corpus. It attempts to:
- Identify which topics are related to positive outcomes and which topics are related to non-positive outcomes
- Pull the top words from the positive outcome topics and non-positive outcome topics
- Highlight which top words from positive and non-positive overlap with each other versus which words are unique to their own topics
- generates 3 files:
pos-non-pos-topics.txt,top-pos-words.txtandtop-non-pos-words.txt
python topic_analysis/topic_analysis.py \
demo_data/all_encounters.res.csv.10-doc-topic-cov.txt \
demo_data/all_encounters.res.csv.10-topics.txt
Requirements:
- Python 3.X
- Python Virtual Environment Package (Included in Python standard library)
The text classifier is responsible for reviewing the notes of the telehealth
encounters and classifying the note as positive outcome versus non-positive
outcome. The classifier module has the following features:
- Load positive and non-positive CSV files generated from the PHI De-identification process
- Clean data by removing PHI redaction sections, non-alphanumeric characters, extra white space, lemmatization, and stop words
- Generate a classifier using the RandomForestClassifier from sklearn
- Evaluate classifier by collecting Recall, Precision, F1 Score, micro averages per category, and the overall classification accuracy
- Store classifier to a file
- Load classifier from a file
- Score optimizer that steps through a combination of number of features and estimators for the classifier model and returns the optimal inputs and score
The process of generating the classifier requires the docs to be cleaned and
vectorized into TF-IDF weights. The vectorized version of the corpus was then
split into two sets; 20% for training and 80% for testing. The model used for
training is the RandomForestClassifier from sklearn which is based on a
Random Forest Algorithm that uses a 'random forest' of numerous decision trees.
The core of the algorithm follows the steps below:
- Pick N random records from the dataset
- Build a decision tree on the randomly selected N records
- Choose the number of trees used in the algorithm and repeat steps 1 and 2
The algorithm is ideal for classification because it is known to reduce biases with the use of multiple randomly formed decision trees and it performs well when unknown data points are introduced. Disadvantages of the algorithm is that the complexity causes it to take longer to train and process due to the amount of decision trees.
# change directory to classifier
cd classifier
# initlize python virtual evnrionment
python -m venv venv
source venv/bin/activate
# install dependencies
pip install -r requirements.txt
Be sure to update the following constants POSITIVE_CSV_FILE and
NO_POSITIVE_CSV_FILE to the true file paths of the redacted data produced
from the De-Identification process. Also update the CLASSIFIER_FILE for where
you want to store the classifier.
The classifier module can be run as a script to quickly generate a classifier
with the pre-optimized defaults determined from testing.
python classifier.py
This will load the data, clean the data, generate a classifier, print out the
evaluation metrics and store it to the path defined in the CLASSIFIER_FILE
constant. An example of the classifier evaluation is shown below.
precision recall f1-score support
non-positive 0.88 0.95 0.91 2014
positive 0.94 0.85 0.90 1842
accuracy 0.91 3856
macro avg 0.91 0.90 0.90 3856
weighted avg 0.91 0.91 0.90 3856
Accuracy: 0.9050829875518672
The classifier can be loaded and used on new documents. Enter the python console and run the following commands
import classifier.py
text_classifier = classifier.load(classifier.CLASSIFIER_FILE)
docs = [
'Scheduled transportation for patient appointment on Thursday',
'discussed personal goals with patient for patient to work on quitting smoking'
]
predictions = classifier.predict(text_classifier, docs)
print(predictions)
The classification accuracy score was optimized by varying the number of features and estimators (decision trees) used in the algorithm. This was a simple iterative algorithm that calculated the accuracy for each feature/estimator combination and then returned the optimal score and the combination used to accomplish.
The classifier module has an optimize_score function that accepts the
following arguments:
docs (default: to cleaned version of dataset) - complete corpus of documents
labels (default to dataset defined) - labels each document
min_features (default: 1000)- start number of features to use
max_features (default: 5000)- max number of features to use
feature_step (default: 250) - amount to increase number of features by
min_df (default: 10) - minimum document frequency for a feature to be selected
max_df (default: 0.8) - maximum document frequency for a feature to be selected
min_estimators (default: 750) - start number of estimators to use
max_estimators (default: 2500) - max number of estimators to use
estimator_step (default: 250) - amount to increase number of estimators by
It outputs a dictionary that contains the following keys:
feature_steps - varying features used
estimator_steps - varying estimators used
scores - 2-dimension numpy array containing all scores generated. Shape is feature stpes length x estimator steps length
optimal_score - The highest accuracy result from the iterations
optimal_num_features - The number of features used to generate optimal score
optimal_num_estimators - The number of estimators used to generate optimal score
The optimal number of features used was determined to be 1500 while the optimal number of estimators was determined to be 750.
As a bonus test, we tested the classifier predictions on the top positive and non-positive words generated from the topic analysis step.
# enter python console
python
# import classifier module
import classifier
# load stored classifier
text_classifier = classifier.load(classifier.CLASSIFIER_FILE)
Classify top positive words
f = open('../demo_data/all_encounters.res.csv.2-topics.txt.top-pos-words.txt', 'r')
pos_docs = [f.read()]
f.close()
print('top pos words: ', pos_docs[0])
print('top pos words classifier predictions: ', classifier.predict(text_classifier, pos_docs)[0])
Classify top non-positive words
f = open('../demo_data/all_encounters.res.csv.2-topics.txt.top-non-pos-words.txt', 'r')
non_pos_docs = [f.read()]
f.close()
print('top non pos words: ', non_pos_docs[0])
print('top non pos words classifier predictions: ', classifier.predict(text_classifier, non_pos_docs)[0])
Output is
top pos words: pharmacy appointment medication service information poa cuff call care sugar pressure concern blood morning meal state report insulin transportation time
top pos words classifier predictions: positive
top non pos words: pharmacy medication education exercise today appt goal inhaler level weight plan pressure knee minute state phone transportation day time ncm
top non pos words classifier predictions: non-positive
Automating the redaction of PHI data is plausible and should be used by data scientists trying to perform analysis on free text health data to respect patient privacy and adhere to HIPPA rules. One thing to note is that the redaction process is slow on large datasets. Redacting PHI on the 20,000 document dataset took nearly an hour running on an Intel i7 10th gen processor. To avoid this issue in a production workflow with much larger datasets, an automated redacted pipeline should be considered where as soon as a note is created, a redaction process is triggered and stored in a separate bucket.
When performing topic mining with our home-crafted PLSA topic miner, we noted that performance was poor on large datasets when compared to an open-source PLSA python package. This was likely due to unoptimized implementation of the EM-algorithm when handling large matrices. While performing topic analysis, we noticed that the fewer number of topics generated made it easier to relate topics to positive outcomes and other topics to non-positive outcomes. As we increased the topic count when performing PLSA, this was no longer the case and the distributions of topics among positive corpus and non-positive corpus were similar. From this behaviour, we can infer that there is definitely a difference of themes discussed in positive outcomes but that there is definitely overlapping themes.
The classifier we created performs only well on large datasets. On the sampled and demo datasets of only 100 records, the maximum classification accuracy that was achieved was 85%. When training the classifier on the complete corpus of 20,000 documents, the classification accuracy jumped to 90%. These maximum scores were calculated by using an iterative algorithm that varied the number of features and the number of estimators used by the classifier. The classifier consistently had better precision at 94% when labelling positive documents versus non-positive but had worse recall at 85% for all positive documents. This means that a user can trust the result of a classification of a positive document but cannot guarantee all to be retrieved. This would be preferred for a recommendation engine.
BONUS: The classifier was also tested on the documents containing the top words from positive and non-positive topics generated from the topic analysis step. The classifier correctly classified the doc containing words from positive topics as 'positive' and the doc containing words from non-positive topics as 'non-positive'.



