Natural Language Processing for Automated Annotation of Clinical Notes: Enhancing Phenotyping and Cohort Identification

Authors

Emre Kaya, Olga Petrov, Arman Grigoryan Hasan Boğaziçi University, Türkiye, Moscow State University, Russia, Yerevan State University, Armenia Author

Natural Language Processing (NLP), Electronic Health Records (EHR), Clinical Notes, Phenotyping, Cohort Identification, Deep Learning

The vast majority of critical patient information is stored within unstructured clinical notes in Electronic Health

Records (EHRs), making it inaccessible to traditional data analysis methods. This paper explores the application of

Natural Language Processing (NLP) to automatically extract and structure this information for enhanced patient

phenotyping and cohort identification. Manual chart review is time-consuming, expensive, and prone to human error,

creating a significant bottleneck for clinical research and quality improvement initiatives. This research details the

development of an NLP pipeline utilizing both rule-based and deep learning models to identify patients with specific

conditions, such as heart failure with preserved ejection fraction (HFpEF), from radiology and cardiology reports. The

results demonstrate that NLP systems can achieve high accuracy in classifying clinical concepts, significantly

accelerating the process of cohort building and enabling large-scale retrospective studies that were previously

infeasible. The discussion addresses challenges related to model portability, linguistic complexity, and the imperative

of integrating domain expertise into the NLP development process.