In biomedical research, “precision medicine” is the buzz-word or phrase du jour permeating the latest conference abstracts, manuscripts, and grant proposals. Precision medicine is broadly defined as using a data-driven approach to offer tailored treatments or prevention strategies to patients. The recent popularity of the precision medicine term is likely due to the 2015 White House launch of the Precision Medicine Initiative (PMI). The PMI allocated federal funds to several agencies including the National Institutes of Health (NIH), National Cancer Institute (NCI), Food and Drug Administration (FDA), and Office of the National Coordinator for Health Information Technology (ONC) to establish infrastructure and research programs to accelerate the availability and delivery of tailored medical treatment to patients.
Precision medicine research, of course, did not develop overnight. Several groups, in fact, have been laying the foundation for local precision medicine implementation efforts. In our first Institute for Computational Biology Symposium in 2015, we heard about these efforts at Geisinger Health System from Dr. Marylyn Ritchie, then Paul Berg Professor of Biochemistry & Molecular Biology at the Pennsylvania State University and Director of Biomedical and Translational Informatics at Geisinger Health System. Dr. Ritchie described Geisinger’s MyCode Community Health Initiative, a biobank of biospecimens from consented patients seen by providers in the Geisinger integrated health system serving central and northern Pennsylvania (PMID:26866580). These biospecimens are linked to the patient’s electronic health records (EHRs), and these linked data can be accessed for research purposes. As of September 2015, ~30,000 patient DNAs had genome-wide genotype data, which contributed to multiple genome-wide association studies for common clinical conditions ranging from cataracts (PMID:25982363) to resistant hypertension (PMID:28222112).
Dr. Ritchie also reported at the time that ~50,000 patient DNAs had whole-exome sequence (WES) data. The sequencing efforts actually represent a collaboration between Geisinger Health System and Regeneron Genetics Center, a wholly owned subsidiary of Regeneron Pharmaceuticals. This collaboration, known as DiscovEHR, began in 2014 with MyCode participants consented for broad genomic research, re-contact, and return of clinically actionable results. Since her Cleveland presentation in 2015, Dr. Ritchie announced the availability of DiscovEHR for research at the 2016 American Society of Human Genetics (ASHG) in Vancouver, Canada (Figure). Soon after ASHG, the first set of clinically-relevant analyses in DiscovEHR were published in companion Science articles (PMID:28008009 and PMID:28008010) describing exome-based discovery efforts for HDL-C, LDL-C, triglycerides, and cholesterol levels; the frequency of potential loss of function (pLoF) rare variants; and the frequency of actionable or clinically returnable genetic variants, the latter of which included the frequency of mutations that cause familial hypercholesterolemia.
On the surface, these reports may seem nothing more than standard genome-wide association studies and counting variants in a large population. Both scale and depth of data and set this resource apart from others. In comparison to DiscovEHR’s samples size of ~50,000, there are larger DNA sample collections including EHR-linked biobanks such as the Veteran Administrations Million Veterans Program (~400,000; PMID:26441289), Vanderbilt University Medical Center’s BioVU (~225,000), Kaiser Permanente’s Genetic Epidemiology Research on Adult Health and Aging (~110,000; PMID:26092718); epidemiologic cohorts such as the UK Biobank (~500,000); and commercial juggernauts 23andMe (~1 million) and Ancestry.com (> 1 million). While these biobanks dwarf DiscovEHR in the number of DNA samples collected, they are limited in the number of samples with genomic data (BioVU), and they are all primarily limited to genome-wide genotype data. Furthermore, the commercial biobanks are currently limited to self-reported health and lifestyle data. The exquisitely phenotyped UK Biobank allows for exome sequencing but is only doing so on a project-by-project basis, and the Million Veteran Program has consented participants for both whole-exome and whole-genome sequencing, but it is not clear when those data will be generated.
As of today, DiscovEHR is the only game in town that offers sequence-level data linked to clinically-collected health outcomes for precision medicine research. This resource cannot be underscored enough. The one-two punch of sequencing data and EHR data in DiscovEHR provides a much needed catalog of the phenotypic consequences of DNA changes at the individual as well as population level. Data from electronic health records are far from perfect, but their depth and longitudinal potential given a relatively stable patient population allow investigators such as Dr. Ritchie to ask the seemingly simplest of questions that have remained inadequately answered due to lack of data. The initial DiscovEHR analyses demonstrate potential in identifying drug targets via patients homozygous for pLoF rare variants (PMID:26933753) and in characterizing penetrance of rare variants in disease-associated genes. Near-term data mining will likely extend these analysis aimed to sort clinically actionable variants from variants of unknown significance. Longer-term data mining will likely include further genomic discovery studies with an emphasis on potential drug targets, pharmacogenomics, and the consequences of pleiotropy.
Perhaps one major deficiency of DiscovEHR is the lack of diversity in study population with respect to race/ethnicity. Approximately 98% of the study participants are European American consistent with the demographics of the ascertainment sites (93% European American). It is interesting to note that 3% of Geisinger’s patient population is African American while only 1% of DiscovEHR is African American. Regeneron recently partnered with Mount Sinai in New York to WES ~33,000 consented from BioMe, an EHR-linked biobank known for its diverse study population. We aim to keep up with Dr. Ritchie and her colleagues as they further mine DiscovEHR in parallel with other WES-EHR efforts in more diverse settings. These large-scale WES-EHR efforts and their anticipated discoveries will undoubtedly dominate and influence the discussion and direction of precision medicine for at least the next year.