The University of Florida’s academic health center, UF Health, has partnered with NVIDIA to develop a neural network that generates synthetic clinical data – a powerful resource that researchers can use to train other models of AI in healthcare.
Trained on a decade of data representing more than 2 million patients, SynGatorTron is a language model that can create synthetic patient profiles that mimic the health records from which it was learned. The 5 billion parameter model is the largest generator of language in the field of health.
“Synthetic data is not actually tied to an actual human being, but it exhibits similar characteristics to real patients,” said Dr. Duane Mitchell, assistant vice president for research and director of the Institute of Science. clinical and translational studies of UF. “SynGatorTron can, for example, create digital diabetes patient medical records that have similar characteristics to a real population.”
Using this synthetic data, researchers can build tools, models, and tasks without risk or privacy concerns. These can then be used on real data to ask clinical questions, look for associations, and even explore patient outcomes.
Working with synthetic data also facilitates collaboration and sharing of models between different research institutions. And because the amount of data that can be synthesized is virtually unlimited, researchers can use SynGatorTron-generated data to augment small datasets of rare disease patients or minority populations to reduce model bias.
SynGatorTron was developed using the open source NVIDIA Megatron-LM and NeMo frameworks. It’s based on UF Health’s GatorTron model, announced last year at NVIDIA GTC. The models were trained on HiPerGator-AI, the university’s in-house NVIDIA DGX SuperPOD system, which ranks among the top 30 supercomputers in the world.
GatorTron-S, a BERT-style transformer model trained on synthetic data generated by SynGatorTron, will be available to developers next month on the NGC Software Hub.
SynGatorTron opens the door to solid training data
For a doctor, an AI-generated doctor’s note might seem impractical at first glance – it doesn’t represent a real patient and won’t read as logically to an expert eye. A clinician cannot therefore make a direct analysis or diagnosis of it. But for an untrained RN, both real and synthetic clinical data are very valuable.
“SynGatorTron’s generative capability is a great enabler of natural language processing for medicine,” said Dr. Mona Flores, Global Head of Medical AI at NVIDIA. “Synthesizing different types of clinical records will democratize the ability to build all kinds of applications dependent on this data by addressing data scarcity and privacy.”
Once available, research institutes outside of UF Health could refine the pre-trained SynGatorTron model with their own localized data and apply it to their AI projects. For example, if a given condition or patient population is underrepresented in a healthcare system’s clinical data, SynGatorTron may be prompted to generate additional data with characteristics of that disease or population.
These AI-generated records could then be used to complement and balance the real-life healthcare datasets used to train other neural networks, so they better represent the population.
Since synthetic training datasets mimic real medical notes without being associated with specific patients, they can also be more easily shared between research institutes without raising privacy concerns.
“When you have the ability to mimic population characteristics without being attached to real patients, it opens up the imagination to see if we can generate realistic datasets that allow us to answer questions we couldn’t. not otherwise, due to data access constraints or limited patient information of interest,” Mitchell said.
One potential application is in clinical trials, which often divide patients into treatment and control groups to measure the effectiveness of a new drug. An application derived from data generated by SynGatorTron could analyze real records and create a digital twin of patient records. These recordings could then be used as a control group in a clinical trial, instead of having a control group derived by giving real patients a placebo treatment.
Researchers developing a deep learning model to study a rare disease, or the effects of a treatment on a specific population, could also use SynGatorTron for data augmentation, generating more training data to supplement the limited amount actual medical records available.
Healthcare at GTC
Register for free at GTC, online March 21-24, to learn about the latest in AI and healthcare. Hear from SynGatorTron collaborators during “A Next-Generation Clinical Language Model” session, taking place March 23 at 7 a.m. Pacific.
Watch the replay of NVIDIA Founder and CEO Jensen Huang’s keynote below: