Background:
Electronic Health Records (EHRs) contain a wealth of patient data, but they are often unstructured and difficult to analyze. Artificial Intelligence (AI) and its application Natural Language Processing (NLP, which is able to interpret and generate human language) can be helpful to extract longitudinal information on the disease course, especially in complex chronic diseases such as Systemic Lupus Erythematosus (SLE). Objectives:
Our aim was to develop an integrated approach that combines clinical knowledge and advanced data science techniques (specifically, automated rule-based system and NLP) to characterize SLE patients in terms of involved disease domains, current symptoms, therapies and disease activity Methods:
A standardized, replicable methodology was created, using data from a training set (development cohort) to extract relevant SLE features. The framework combined both AI-based steps with human intelligence (HI). A stepwise sequence was followed (1 and 4 HI-based; 2,3, and 5 AI-based): 1) ontology definition, that specifies relevant SLE attributes that characterize patient status at time of visit. Namely, we decided to extract: a) disease domains (hematological, cutaneous, articular, kidney, serositic, systemic, neurological, vascular involvement); b) current symptoms; c) therapies; d) disease activity expressed as SLEDAI-2K. 2) creation of a structured body of knowledge, where EHRs are selected and preprocessed using segmentation and tagging techniques 3) extraction of information specified in step 1 by an automated NLP algorithm, able to identify from EHRs, for each patient's contact, the lupic attributes previously defined 4) development of a rule-based framework determining how the SLE attributes, biomarkers and patient's history are combined to characterize the disease domains (Figure 1) and disease activity 5) implementation of the rule based-framework to classify for each patient's contact in terms of lupic attributes Finally, the clinical records of 56 patients (excluded from HERs used to develop the algorithm, validation cohort) were examined by a group of physicians who manually extracted SLE attributes. Thereafter, the information was compared with the one extracted by the NLP algorithm: accuracy of algorithm was tested against the gold standard (manual extraction further revised by a second team of expert clinicians). Furthermore, distribution of SLEDAI-2K extracted with the algorithm (proxy SLEDAI) was compared to the SLEDAI-2K manually annotated by physicians (manual SLEDAI). Results:
The framework was applied to a cohort of 262 SLE patients, with a median of 18 (11- 28) contacts, in a temporal window of 7 (4-10) years, for a total of 4567 EHRs. In the 56 patients of the validation cohort (n contacts 12.5, 10-17), the most frequently reported involved disease domains were articular (59%), cutaneous (62%), hematological (60%), neurological (20%), kidney (34%), serositic (20%), systemic (16%) and vascular (30%) involvement. Among symptoms, the most frequent were arthromyalgia (78%) and erythema (64%). Antimalarials, traditional immunosuppressant and biologics were used by 79%, 75% and 27% of the patients. These percentages reflected plausible values for an SLE population and this was considered as proof of face validity. Accuracy [n of true positives and negatives/all observations] for the NLP algorithm to extract data was in the range of 99-100% for disease domains, 97-99 % for symptoms, and 93-98% for therapies. Variance distribution of SLEDAI and proxy SLEDAI was not significantly different (Levene's test 1.58, p=0.21) (Figure 2). When looking at the effort required to extract data from EHRs, the mean time to extract the lupic features from EHRs through the framework was in the range of 10 mins for a cohort of 262 patients, to be compared with an effort of 2 hours per patients through HI. Conclusion:
The proposed framework integrates domain expertise and AI-based techniques to deliver a validated longitudinal phenotype characterization for each SLE patients. The application of this technique to elaborate real-life SLE data seems promising and feasible, with a relevant spare of human effort. REFERENCES:
NIL. Acknowledgements:
This work was funded by AstraZeneca Disclosure of Interests:
Augusta Ortolan Janssen, Novartis, Abbvie, UCB Pharma, Livia Lilli: None declared, Silvia Laura Bosello: None declared, Laura Antenucci: None declared, Carlotta Masciocchi: None declared, Jacopo Lenkowicz: None declared, Piergiacomo Cerasuolo: None declared, Lucia Lanzo: None declared, Silvia Piunno: None declared, Gabriella Castellino AstraZeneca, Marco Gorini Astrazeneca, Stefano Patarnello: None declared, Maria Antonietta D' Agostino Novartis, BMS, Janssen,Pfizer, Amgen, Galapagos, AbbVie, UCB, and Eli Lilly