It has been a few months since my last post, so you might be asking, “Girl, what have you been up to lately?” Well, let me tell you….
The fall semester is always crazy busy. Classes begin. Our annual symposium is in full swing. And, the conferences. So many conferences.
One unexpected meeting to hit my calendar this fall was the Million Veteran Program (MVP) Phenotype Workshop (Emerging Innovations and Future Directions for Use of Electronic Health Data) held September 12-13, 2016 in Washington, DC (Figure 1). This two-day workshop was a meeting of all alpha and beta study sites that will be accessing the MVP dataset for the first rounds of genetic association studies for precision medicine research.
For those of you not familiar with the MVP, it is one of the largest existing biobanks linked to electronic health records (EHRs) in the United States (PMID:26441289). Participants in MVP are Veterans who access the Department of Veterans Affairs health care system. The MVP biobank is large, diverse (by gender, race/ethnicity, geography, and other demographics), and rich in longitudinal EHR data. The latter attribute is especially exciting—the VA is the largest integrated national health system in the United States, and its electronic health records span at least 20 years.
I along with colleague Dr. Sudha Iyengar (Figure 2) was invited to the MVP Phenotype Workshop as representatives and Case Western Reserve University (CWRU) co-investigators of the beta site “Genetic Risk for AMD in Diverse Veteran Populations,” a grant awarded to Drs. Neal Peachey and Eric Konicki of the Louis Stokes Cleveland VA Medical Center. Our group was quickly dubbed “the eye group” given our phenotype was unique amongst the other funded MVP projects. Despite our disease-of-interest differences, all groups funded to access the MVP have one need in common: the need for data to be extracted from the EHR.
Oh, boy, this is easier said than done. I know first-hand from several experiences that EHR data extraction for research purposes is very, very difficult. As an investigator in phases I and II of the electronic Medical Records & Genomics (eMERGE) Network and one of the PIs of the Coordinating Center, I know it takes a village to first define the variable, let alone to devise a strategy for its extraction and validation. And, in my previous work as a PI of the Population Architecture using Genomics and Epidemiology (PAGE) I study, we found that our EAGLE BioVU EHR dataset often suffered from missingness and lack of structure and standardization, particularly for behavior/lifestyle and exposure data. Data extraction and data cleaning in EHRs is a time- and resource-intensive endeavor, to say the least.
Based these ongoing experiences with EHRs in genomic research, I was quite curious to see how the MVP as a group was going to approach these problems. I have to say that after a day and a half of the MVP Phenotype Workshop, I was pleasantly surprised—nay, impressed! The MVP Phenotype Workshop organizers were obviously well-versed with the problems and potentials in using the VA’s EHRs for research, and they assembled a great group of experts who presented data and approaches on a host of topics ranging from the various perspectives on phenotyping, innovations in electronic and automated phenotyping, demonstrations of existing tools and methods, and discussions of platforms and infrastructures available and on the horizon.
I especially enjoyed the presentation given by Dr. Katherine Liao on the Applications of Electronic Medicine Record Phenotype Algorithms for Clinical Research in the Cross-Cutting Innovations in Phenotyping session moderated by Dr. Kelly Cho. Dr. Wendy Chapman also gave a very nice presentation via telephone on Deep Phenotyping Leveraging Expert Knowledge where she touched on work to extract social support from clinical free text. Dr. Chapman’s recounts of extracting these important determinants of health are very much in agreement with our recent experiences in extracting socioeconomic status from the clinical notes (PMID:27896978). The highlight of Day 1 was the breakout sessions. I attended the Challenges in Phenotyping in Disease Domains, where I sympathized with Dr. David Gagnon, Chief Data Scientist & Biostatistician of the Boston VAMC, as he stepped through the process of extracting and cleaning a single lab value from the EHR. By far, the highlight of Day 2 of the workshop for me was Dr. Georgia Tourassi‘s Phenotype Informatics @Scale: The DOE Experience. Dr. Tourassi quickly stepped through examples where big data mining approaches could be applied to large-scale and cost-effective phenotyping efforts and association studies [such as mining obituaries on the internet to replicate the association between parity and cancer risk (PMID:26615183)].
All in all, the meeting was informative if not overwhelming (the VA jargon alone requires another workshop, in my opinion!). I was very much encouraged, though, by the overall discussions. Many involved in the MVP realize that resources and infrastructure must be made available for investigators for the MVP dataset to reach its fullest potential. These will necessarily include phenotyping and the creation of a database of clean MVP EHR variables. I cannot tell you how much of a big deal this is for the individual investigator. I have worked with biobanks linked to EHRs that did not provide this infrastructure. In this kind of setting, it’s every investigator for him or herself, which effectively translates into wasting resources on duplicating efforts. Kudos to the MVP for thinking ahead! Of course, there’s much work to be done before the resources are a reality. But, I think this group can do it (Figure 3).