How do you train an AI to understand clinical language with less clinical data? Train another AI to synthesize training data.
Artificial intelligence is changing the way medicine is done, and is increasingly being used in all sorts of clinical tasks.
This is fueled by generative AI and models like GatorTronGPT, a generative language model trained on the University of Florida’s HiPerGator AI supercomputer and detailed in a paper published in Nature Digital Medicine Thursday.
GatorTronGPT joins a growing number of large language models (LLMs) trained on clinical data. Researchers trained the model using the GPT-3 framework, also used by ChatGPT.
They used a massive corpus of 277 billion words for this purpose. The training corpora included 82 billion words from de-identified clinical notes and 195 billion words from various English texts.
But there’s a twist: The research team also used GatorTronGPT to generate a synthetic clinical text corpus with over 20 billion words of synthetic clinical text, with carefully prepared prompts. The synthetic clinical text focuses on clinical factors and reads just like real clinical notes written by doctors.
This synthetic data was then used to train a BERT-based model called GatorTron-S.
In a comparative evaluation, GatorTron-S exhibited remarkable performance on clinical natural language understanding tasks like clinical concept extraction and medical relation extraction, beating the records set by the original BERT-based model, GatorTron-OG, which was trained on the 82-billion-word clinical dataset.
More impressively, it was able to do so using less data.
Both GatorTron-OG and GatorTron-S models were trained on 560 NVIDIA A100 Tensor Core GPUs running NVIDIA’s Megatron-LM package on the University of Florida’s HiPerGator supercomputer. Technology from the Megatron LM framework used in the project has since been incorporated with the NVIDIA NeMo framework, which has been central to more recent work on GatorTronGPT.
Using synthetic data created by LLMs addresses several challenges. LLMs require vast amounts of data, and there’s a limited availability of quality medical data.
In addition, synthetic data allows for model training that complies with medical privacy regulations, such as HIPAA.
The work with GatorTronGPT is just the latest example of how LLMs — which exploded onto the scene last year with the rapid adoption of ChatGPT — can be tailored to assist in a growing number of fields.
It’s also an example of the advances made possible by new AI techniques powered by accelerated computing.
The GatorTronGPT effort is the latest result of an ambitious collaboration announced in 2020, when the University of Florida and NVIDIA unveiled plans to erect the world’s fastest AI supercomputer in academia.
This initiative was driven by a $50 million gift, a fusion of contributions from NVIDIA founder Chris Malachowsky and NVIDIA itself.
Using AI to train more AI is just one example of HiPerGator’s impact, with the supercomputer promising to power more innovations in medical sciences and across disciplines throughout the University of Florida system.