Integrating structured and unstructured sources for temporal representation of patients’ medication histories
Abstract
Large-scale healthcare data are becoming increasingly available for research through the widespread adoption of electronic health records (EHRs). EHRs typically contain a multitude of diverse data, including prescriptions, discharge summaries, lab results, etc. Extracting information from these disparate sources in EHRs poses unique chal- lenges due to the high degrees of variability in terms of content and structure. This thesis presents a set of methods for automatically integrating structured and free-text EHR data in order to create comprehensive, structured medication histories for individ- ual patients. The interpretation of a patient’s medication record requires extraction and reasoning over clinical events (i.e., prescriptions) and their temporality. For example, it requires understanding when drugs were prescribed and for how long, and whether these periods overlapped with the prescribing of other drugs. In most EHRs, this infor- mation can be found in the form of time-stamped prescription records (i.e., structured data) and/or in clinical unstructured narratives (e.g. notes and letters). Since structured and unstructured parts of the EHR are often complementary, we here investigate au- tomatically integrating structured and unstructured patients’ medication histories. We created a pipeline with five steps: (1) a named entity recognition method to identify free-text time expressions (TIMEXs), drug mentions and drug-associated attributes in clinical narratives; (2) a temporal relation extraction method to classify temporal re- lationships between prescriptions and TIMEXs; (3) a drug status detection method to label prescriptions with a status; (4) a drug name normalisation method to map drug names to standardised terminologies; and (5) an entity resolution method to identify whether two or more drug mentions are referring to the same prescription. We applied the pipeline to data from MIMIC-III, which was then empirically evaluated by a team of clinicians. We report on the overlap between medication information present in EHR structured and unstructured data, and the impact of adding extracted information from unstructured clinical text to the accuracy of medication histories obtained from structured data. On average, more than half of the medication prescriptions (54%) were found in both sources. However, about 25% of prescriptions were only found in the structured data, and about 21% of prescriptions were only found in the unstructured data. Such integrated, structured medication histories enable enhanced applications for direct care (e.g., computerised decision support) and secondary uses of EHR data (e.g., pharmacoepidemiological and drug safety studies).