AI, alcohol and surgery

Danger lurks everywhere. Any human activity, be it work or pleasure, can be a source of great problems or even lethal consequences. It sounds quite frightening and extremely pessimistic. If one is afraid of everything in the world, then life loses its meaning, but no one forbids using caution where it is necessary. And within the framework of alcohol consumption, caution is simply inherent. Researchers from the University of Michigan (Ann Arbor, Michigan, USA) conducted a study that suggests that using artificial intelligence to scan surgical patients medical records for signs of risky alcohol use could help identify those whose alcohol use increases their risk of problems during and after surgery. How was the AI configured, how did the AI perform, and what new things did the AI tell us? We find the answers to these questions in the researchers' report.

The basis of the study

Alcohol is the most commonly used addictive substance, and risky preoperative alcohol consumption (i.e., >2 standard servings of alcohol per day before surgery) is one of the most common surgical risk factors. Patients who drink beyond healthy norms, including those with alcohol use disorders, have increased rates of infections, wound and pulmonary complications, and prolonged hospital stays after elective surgical procedures. Individuals with risky alcohol use and alcohol use disorders use the greatest proportion of health care resources postoperatively due to complications, increased length of hospital stay, and postoperative re-hospitalizations.

In an era of limited resources, identifying patients with risky alcohol use can save lives and reduce costs. However, patients with risky alcohol use are overlooked for surgical treatment. When identified in a timely manner, at-risk individuals can receive preoperative interventions and, if necessary, alcohol withdrawal prophylaxis before or after surgery. However, alcohol screening is often not performed, is conducted too close to the date of surgery, or is based on incorrect single-item questions, resulting in incomplete detection or biased identification of at-risk alcohol use. Simply put, virtually all alcoholics do not identify themselves as such.

In addition to improving proven screening methods, one way to address this gap in accurately identifying risky alcohol use is to utilize existing data and technology. Electronic health records (EHR from electronic health record) contain highly relevant real-world health data updated in real time for large numbers of patients over many years, especially in large integrated health systems. Clinical records may contain alcohol use history, alcohol screening data, details of alcohol-related events, diagnoses, and billing data. However, alcohol-related conditions are often undiagnosed, resulting in the most readily available information in EHRs, such as diagnosis codes and International Classification of Diseases (ICD from International Classification of Diseases) problem lists, being of limited utility. Thus, EHRs do not reflect the true prevalence of alcohol use disorders or risky alcohol use in the population. However, much of the alcohol information in clinical texts is only available through detailed manual analysis of tables, which is time-consuming and infeasible for many studies and clinical scenarios. Full utilization of alcohol data in EHR data requires innovative methods.

An ideal tool for extracting meaningful information from unstructured clinical text is natural language processing (NLP from natural language processing). NLP is a field of artificial intelligence and machine learning science that includes computational methods and techniques used to extract meaning from human language and text. NLP covers many areas of natural language understanding, including machine translation, syntactic and semantic analysis, coreference resolution, entity detection, and discourse analysis.

This work presents work on building an NLP-based algorithm to identify risky alcohol use among surgical patients using existing preoperative EHR data. This included clinical notes and other sources of clinical texts 3 years prior to surgery, including those from inpatient, outpatient, and emergency departments.

Study Preparation

Researchers conducted an observational cohort study among preoperative patients enrolled in the Michigan Genomic Initiative (MGI from Michigan Genomic Initiative).

Image #1

MGI recruited adult participants scheduled for surgery at Michigan Medical Center, creating a corpus of patient community data, including longitudinal electronic health information in a single cohort available to researchers. At the time of the study, there were 61502 patients enrolled in MGI. As shown in Image #1, patients were included in the cohort if they were enrolled in the MGI cohort between May 29, 2012 and April 17, 2019 and underwent surgery within 90 days of inclusion in MGI (N = 53949). Those who did not have linkage of clinical records and structured data from MGI were excluded. Researchers extracted textual clinical records for the past 3 years from the patient's medical records.

The 3-year period was chosen to create a corpus of records covering a long but relatively recent period. The clinical text records included outpatient and inpatient records of patient admissions, treatment progress, procedures, and discharges, as well as letters, reports, laboratory tests, and comments on them. These records were combined chronologically to create a single document for each patient (referred to as "patient charts" in the following text). The final data corpus included N = 53811 patients.

The primary goal of the study is to categorize preoperative patients, using text-based clinical records from the past three years, into one of two classes - those with "risky alcohol use," which also includes alcohol use disorders, or those without. Risky alcohol use is defined as a combination of several factors:

Quantity/frequency limits. For women, this was defined as having four or more standard drinks on any given day or eight or more drinks per week. For men, this was defined as having five or more servings of alcohol on any day or 15 or more servings per week.

AUDIT-C limits. Standard threshold scores from a validated alcohol screening method called the Alcohol Use Disorders Identification Test-Consumption (AUDIT-C from Alcohol Use Disorders Identification Test-Consumption), which is commonly used before surgery, were also used. Researchers used a standardized threshold for risky alcohol use: a score of 3 or more for women and 4 or more for men.

Binge drinking, alcohol use disorders, and other textual indicators. Risky alcohol use was also categorized if patient records indicated past or present alcohol use disorder or other diagnoses indicating an underlying alcohol use disorder (e.g., alcohol-related cirrhosis), as well as those patient records that mentioned episodes of binge drinking and other comments indicating problems associated with problem drinking.

The next phase of the study involved building a prototype system. First, researchers used a keyword-based approach to capture contextual cues associated with risky alcohol use in clinical records. These words were borrowed from the Unified Medical Language System (UMLS) and encompassed both terms describing patterns or problems of alcohol use (e.g., abuse, addiction, binge, concern, dependence, excessive, and heavy) and health conditions and behaviors associated with alcohol use (e.g., drunkenness, inebriety, insobriety, intemperance, intoxication). The keyword list also included concepts related to alcohol use causing anxiety and depression, as well as additional rare words such as dissoluteness, distraught, debauchery, dipsomania, and bibulousness.

The list was shown to three alcohol use disorder experts (ACF, JM, and GSW), who suggested additional frequently used alcohol-related keywords and phrases in the EMC based on their clinical experience and analysis of case histories. An electronic medical record search engine (EMERSE from electronic medical record search engine) was used to search patient records for alcohol-related keywords. This list of keywords was further refined by evaluating their effectiveness in identifying additional patients. Morphological variants of the keywords were obtained by manipulating suffixes and added to the keyword list.

This resulted in a final list of 36 keywords categorized into alcohol-related, alcohol-related disorders, and alcohol-related diagnoses or events. Negative and irrelevant phrases (e.g., alcohol wipes, isopropyl alcohol, water, and juice) were also collected.

The NLP model pipeline consists of the following steps of preprocessing, text analysis, and segment classification.

Image #2.

First, patient input records were segmented into small cohesive text snippets using XML tags, annotation formatting tags, and blank lines. Sections include problem lists, medication lists, patient instructions, and structured sections of clinical records (e.g., patient history, assessment, plan, etc.). Text segments are further broken down into sentences using the Natural Language ToolKit (NLTK). Punctuation marks and other formatting information were retained or replaced with special tokens to facilitate subsequent tasks.

The set of keywords identified during prototype development was used to identify "hotspots"-recording segments with any mention of alcohol-related keywords.

A contextual window of words surrounding hotspots is searched for negations and contextual modifiers such as hedging words ("maybe," "probably," "likely," "exclude," etc.) or words indicating that alcohol use is not alcohol-related ("drink 8 ounces of water with this medicine").

Clinical notes often include boilerplate text that instructs patients not to consume alcohol before surgery or while taking medications, even if the patient is not at high risk for alcohol consumption. These instruction forms, patient-provider agreements, and other standardized template text segments are removed and replaced with a filler indicating that the text has been removed.

Patient records are searched for explicit mention of the patient's gender. This information was then validated with structured metadata to ensure the accuracy of alcohol consumption thresholds based on gender.

Number of drinks, such as quantity (e.g., 1.6 ounces), amount (e.g., seven cans of beer), frequency (e.g., 3 per day, 10 per week), are identified and normalized. Units of alcohol such as pints, cans, and shots are converted to standard U.S. beverages (1 drink = 14 grams of pure alcohol), and daily counts are converted to weekly counts. These quantities are searched for in a context window consisting of a certain number of words following the appropriate alcohol-related keyword. The length of the context window varied for different measures of alcohol use, ranging from 3 to 8 words, and was selected empirically.

Mentions of past alcohol use or alcoholism in the context of the patient's medical history or in a section of the medical history were identified and labeled as positive for risky alcohol use.

A segment was considered positive if the evidence met certain qualitative or quantitative criteria related to alcohol use based on gender-dependent risky drinking thresholds for quantity, frequency, AUDIT-C, and binge drinking. Segments indicating past risky alcohol use were labeled as positive. Segments pertaining to family history and segments with explicit denial were labeled as negative.

If any segment was labeled positive for risky alcohol use, the patient was considered positive. On the other hand, if none of the segments were labeled as positive, the patient was labeled as negative.

From the final dataset of N = 53811 patients, 1200 patient records were randomly selected to be used for data exploration, prototype development, and model training. From these 1200 records, a subset of 500 patient records was selected using alcohol-related keywords to create a training subset, labeled in gold. The 500 patient records were manually labeled by four expert annotators so that each patient record was double annotated. Two annotators were experts with clinical training in alcohol use disorders, and the other two were trained in alcohol research. All annotators had access to the algorithm's decisions, and when they disagreed with the algorithm's decisions, they provided explanations for their own decisions whenever possible. When two annotators disagreed, a third alcohol use disorder expert acted as an arbiter.

To create the labeled test set, 100 patient records were randomly selected from the remaining 52611 and were labeled by the algorithm. These were selected to create a roughly balanced set of positive and negative cases. These cases were also labeled with four annotators, in groups of two pairs, similar to the annotators of the training set. However, this time the annotators labeled the entries without labeling the algorithm (blind annotation). All disputes were reviewed by a third alcohol use disorder expert.

For comparison, researchers also categorized patients in the dataset based on a nationwide system using ICD-9 and ICD-10 codes from electronic medical records. Patients with any diagnosis code in the past 3 years from a list of alcohol-related codes published as part of the AHRQ (from the Agency for Healthcare Research and Quality) clinical classification software were considered labeled as positive through this alternative approach. This approach mimics the current practice of identifying patients with alcohol problems using only diagnosis codes and is universally referred to as ICD code-based labeling.

NLP-based labeling and ICD code-based labeling were compared with expert assessment of 100 cases in a test set. This provided a direct assessment of the performance difference of the state-of-the-art diagnostic codes compared to the proposed NLP approach. The use of human-coded data as goldmarks also allowed us to investigate which approach accurately identifies higher risk drinking (and by which factor). It also highlighted the limitations of individual models and the implications of using them to label the rest of the dataset. The main assessment measures used are sensitivity, specificity, the predictive value of a positive outcome (PPV) and the F1 score, the harmonic mean of sensitivity and PPV. All measures range from 0 to 1, with higher numbers indicating better performance.

Results of the study

Table #1: Demographic and clinical characteristics of the sample.

Patient characteristics are summarized in Table #1. Participants were 52.6% female (49% in the annotated test sample), predominantly white (90.1%, 95% in the annotated test sample), and non-Hispanic (98.1%, 97% in the annotated test sample). The mean age was 53.6 years, and 4.8% of the sample had a diagnosis code indicating risky alcohol use or alcohol use disorder in the past 3 years.

Table #2: Comparison of the performance of the ICD code-based approach and the NLP-based approach on the human-labeled test set.

Using the test evaluation dataset, the researchers compared the effectiveness of their NLP approach to the ICD code-based approach on the human-labeled test set (Table #2). People categorized 31 of the 100 cases as meeting criteria for risky drinking ("positive") and 69 as not meeting criteria for risky drinking ("negative"). Using the same set of human-labeled tests, the NLP algorithm labeled 38 of 100 cases as positive and labeled 62 cases as negative. The ICD codes identified 16 cases as positive and 84 as negative. Of the 31 positive cases flagged by the human, NLP correctly identified 27, while only 9 were correctly detected by the ICD codes. The NLP-based approach had a sensitivity of 0.87, specificity of 0.84, PPV of 0.71, and F1 score of 0.78. The ICD code based approach had a sensitivity of 0.29, specificity of 0.90 and PPV of 0.56 giving an F1 score of 0.38.

Table #3: Detailed breakdown of the NLP and ICD code-based classification of the human labeled test set.

Table 3 shows a more detailed comparison of how NLP and the ICD code-based approach classified patients with respect to human labeling. Out of 31 positive cases, NLP identified 18 patients that were missed by ICD codes, whereas ICD codes did not identify any cases missed by NLP.

Table #4: Comparison of the effectiveness of ICD code and NLP approaches on the full sample.

Researchers also compared the effectiveness of NLP and ICD codes on the full dataset of 53629 patients (training and test cases removed), as well as the proportion of the full dataset that received a positive classification for each method. Table 4 shows the correspondence between NLP code and ICD 0 on the full filtered dataset. NLP classified a total of 7794 patients as positive for risky alcohol use, whereas the ICD code classified a total of 2595 as positive. The NLP and ICD codes agreed on the positive classification of 1670 patients and 44910 negative patients. The ICD codes provided 925 new positives, with NLP 0 providing 6124 of its own positive patient cases. Overall, NLP classified about 14.5% of patients as positive for risky alcohol use, while the ICD code gave 4.8% positive ratings.

For a more detailed look at the nuances of the study, I recommend checking out the researchers' report.

Epilogue

In the paper we reviewed today, scientists created a system capable of determining the degree of risk during surgical intervention caused by a patient's alcohol consumption.

The protagonist of the work of this system was not even a human, but an AI. During the preparatory and training steps, the AI was trained on databases from medical records. The AI paid special attention to words that either directly or indirectly indicate the fact of alcohol abuse, at least before the operation.

As the scientists themselves point out, many people who drink regularly don't have a drinking problem, and when they do, they may never be formally diagnosed with an alcohol use disorder or addiction, which would be easy for the surgical team to detect in their medical records.

The artificial intelligence model found evidence of risky alcohol use in the records of 87% of patients identified by experts (humans) as alcohol abusers. However, only 29% of these patients had an alcohol-related diagnosis code in their list of diagnoses. Thus, many patients with a higher risk of complications would have slipped under the radar of their surgical team.

Next, AI accessed medical data from about 53,000 anonymized patient medical records collected as part of the Michigan Genomics Initiative. Overall, 15% of patients met the criteria of the AI model, compared to 5% of patients using diagnostic codes.

The new findings suggest that surgical clinics that simply check the diagnostic codes listed in incoming patient charts and flag such things as alcohol use disorder, alcohol dependence, or alcohol-related liver disease will be overlooking many high-risk patients. And after all, combining surgery and alcohol is a very bad idea.

Consequently, the developed technique can identify high-risk individuals in the preoperative period due to their alcohol consumption. In the future, this AI system can be used to assess other risks based on the patient's medical records.

Traduced form here