About
Clinical notes and other free-text documents provide a breadth of clinical information that is not often available within structured data. Transformer-based natural language processing (NLP) models, such as BERT, have demonstrated great promise to improve clinical text processing. However, these models are commonly trained on generic corpora, which do not necessarily reflect many of the intricacies of the clinical domain.
Developers and researchers have created models using two approaches: pre-training and fine-tuning. Pre-training “in-domain” entails training a model on domain-specific text to learn to induce contextualized word embeddings. Domain-specific models have demonstrated improved performance, but the training task comes at a significant computational cost. In fine-tuning, one typically adds modules (e.g., a single linear layer) on top of an existing architecture to add domain-specific context to the model with significantly less computational cost and a smaller corpus of data than required for pre-training.
This project evaluates the extent to which varying degrees of pre-training, fine-tuning, and transfer learning using transformer-based models can improve clinical NLP performance. Using clinical text from Vanderbilt University Medical Center, we pretrain and fine-tune BERT models and evaluate performance against publicly available models in three distinct tasks: tokenization, training, and language modeling. Finally, we apply these models to clinical use cases and evaluate their effect on downstream tasks.