This research presents an automated disease diagnosis framework that leverages natural language processing and deep learning to predict diagnosis codes from unstructured electronic health record (EHR) clinical notes. Using the MIMIC-III critical care dataset, clinical narratives such as discharge summaries and physician notes are extracted, validated, and preprocessed to construct a labeled corpus aligned with ICD-9 diagnoses. The study implements a pipeline comprising text cleaning, feature extraction, and supervised learning, and compares traditional models such as Logistic Regression and Bi-LSTM with transformer-based architectures built on BERT. Models are trained and evaluated with categorical cross-entropy loss and standard multi-class metrics, including accuracy, F1-score, and ROC-AUC, while hyperparameters such as learning rate, optimizer configuration, and training epochs are systematically tuned. Experimental results show that BERT-based classifiers substantially outperform conventional baselines, achieving higher accuracy and F1-scores and demonstrating strong robustness in handling complex clinical terminology and context. These findings highlight the potential of transformer-based NLP models to enhance clinical decision support and large-scale phenotyping, while also underscoring limitations related to label noise, dataset-specific bias, truncated input length, and the need for more comprehensive interpretability and multi-label modeling in future work.
| Date of Award | 2025 |
|---|
| Original language | American English |
|---|
| Awarding Institution | - Eastern Illinois University
|
|---|
| Supervisor | Toqeer A Israr (Supervisor) |
|---|
Disease Diagnosis Using Natural Language Processing and Deep Learning
Sherla, M. K. (Author). 2025
Student thesis: Master's Thesis › Master of Science (MS)