The National Health and Nutrition Examination Survey (NHANES) is a population survey implemented by the Centers for Disease Control and Prevention (CDC) to monitor the health of the United States whose data is publicly available in hundreds of files. This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables. The variables consist of phenotype and environmental exposure information on each individual, specifically

  • Demographic information
  • Physical exam results (e.g., height, body mass index)
  • Laboratory results (e.g., cholesterol, glucose, and environmental exposures)
  • Questionnaire items
Second, the data descriptor describes a dictionary to enable analysts find variables by category and human-readable description.

The datasets are available on DataDryad and a hands-on analytics tutorial is available on GitHub. Through a new big data platform, BD2K Patient Centered Information Commons (http://pic-sure.org), we provide a new way to browse the dataset via a web browser and provide application programming interface for programmatic access.


The integration of clinical and biomedical data hosted in multiple distributed repositories is confronted by two significant challenges: i) correctly linking information pertaining to the same patient across repositories, for example, linking lab results data with bedside observations data; and ii) making data available for analysis at different locations across a collaboration network. These problems are exacerbated in the case of rare diseases research, given the very limited availability of data sets and data standards.

We propose to develop the NCAT Global Repository for Rare Diseases Research (GRDR) based on BD2K PIC-SURE platform to address these challenges. NCAT GRDR repository will be a scalable, secure, and flexible integration architecture for clinical and biomedical datasets, which by extending the successful i2b2/tranSMART platform will allow data providers to easily share their data with the wider research community without requiring them to subscribe to proprietary vocabulary standards or to develop complex mapping protocols. Using federated data access and querying methods that retrieve relevant data from different locations before combining them, GRDR will make it possible for comparative analysis methods to be executed on the integrated datasets. By assigning generic identifiers (after de-identification) to related data across locations, GRDR will ease the difficulties of linking data while conforming to the requirements of patient data privacy and other security regulations.


The Exposome Data Warehouse (EDW) is a unified database of environmental information that enables quick data linkage between geolocated environmental information and individual-level data (ie, from electronic health records). Currently, EDW contains EPA air data, NOAA weather data, and American Community Survey socioeconomic and demographic data.

Please visit https://github.com/hms-dbmi/exposomeDW_public for tutorials and other EDW-related information.