Routinely acquired medical data can be classified into structured and unstructured data. Whereas structured data refer to information that is stored in a consistent, organized manner and is typically reported using standard units and ranges (e.g., laboratory results, vital signs, ICD-based categorical diagnosis), unstructured data are devoid of a clear organization and precision (e.g., imaging results, clinical notes) [1]. Crucially, the majority of available clinical data in EHRs are unstructured [11] (Fig. 1).
The free-text narratives jotted down by health professionals (including physicians, nurses, and hospital pharmacists [3]) in EHRs reflect current clinical practices and provide a window into real-time, real-world clinical data. However, the complexity of the free text poses a significant methodological bottleneck to access, organize, and analyze written language with big data analytics.
Accessing the free-text information in EHRs: the role of AI and NLP
The extraction of written text from EHRs is achieved through a combination of NLP and machine learning techniques. NLP is a field that borrows concepts and techniques from linguistics, computer science, and engineering to process naturally occurring language (i.e., speech or text), whereas machine learning models enable computers to extract patterns in datasets and draw conclusions on their own. Deep learning classification methods, which feed and learn from large amounts of data in EHRs, are used to teach the system to describe medical entities in terms of negative, speculative, or affirmative clinical statements. The extracted and processed information is then structured with artificial neural networks. Finally, analytical tools such as random forests, decision trees, and logistic regression enable the construction and visualization of predictive models derived from EHR data.
Extracting clinical information from free text is certainly challenging [7]. The main difficulties revolve around incorporating essential features of language, including temporal relationships, context, homonym use, and acronyms. A recent systematic review on the use of NLP to extract clinical information also pointed out other important technological gaps regarding concept understanding, causal inferences, and external validation of NLP-extracted data with annotated clinical corpora [12]. Despite these limitations, NLP is a cost-effective clinical tool; it has been estimated that 1 h of NLP system development saves at least 20 h of manual reviewing of medical records, with optimal sensitivity and specificity [13].
EHRs and big data advance healthcare delivery
The effective exploitation of big data is thought to advance healthcare delivery by promoting the following actions [7, 11].
Generation and dissemination of data-driven medical knowledge in a timely fashion
The costs and time associated with manual data collection largely surpass those associated with the use of automatized tools. The combination of machine learning and NLP to explore EHRs has offered novel descriptive and predictive insights into clinical populations [9], patient management [14], and pharmacovigilance [15], and shows great promise for the generation of computerized clinical decision support (CDS) [16].
Personalized care
By integrating patients’ ‘-omics’ data (i.e., genomics, proteomics, microbiomics) with the information captured in EHRs, the Electronic Medical Records and Genomics (eMERGE) Network [17] has already identified unknown associations between patients’ genetic information and the clinical information in their EHRs in diverse therapeutic areas including ophthalmological and cardiovascular diseases.
Healthcare management and optimization of resource use
Clinical information in EHRs can be exploited to perform real-time predictive analyses to optimize resource use and management in terms of cost–benefit analysis. Relevant predictive outcomes achieved via analysis of EHR data include identification of risk factors associated to high-cost patients, readmissions, triage, and decompensation [7].
Improving the state of the art in EHR studies
To move the field forward, we believe that the following three aspects should be considered in NLP research using EHRs. First, these studies always benefit from a multicentric, multilanguage methodology; unlike single-center studies, this approach enables access to even larger datasets (in turn generating more accurate predictive models), inclusion of more diverse study populations, and the possibility of comparing results across centers and regions. Second, the output of a clinical NLP system should always validated against a corpus of expert-reviewed clinical notes in terms of sensitivity and recall of extracted medical concepts [14]. Finally, researches must always guarantee the confidentiality and security of the data, in compliance with hospital ethics committees, national and international regulations, and pharmaceutical industry policies. Following these recommendations, the use of available research tools such as the EHRead® technology now allows researchers to rapidly answer clinical questions in real time using patient-centered data [14, 18, 19]. A summary of this methodological approach is depicted in Fig. 2.