n2c2 NLP Research Data Sets

Description

Temporarily Unavailable

The n2c2 datasets are temporarily unavailable. If you are trying to access data from the 2019 Challenge, tracks 1 (Clinical Semantic Textual Similarity) and 2 (Family History Extraction) are available directly through Mayo Clinic.

The majority of these Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Links to key citations are provided below.

2006 - Deidentification & Smoking
- Evaluating the state-of-the-art in automatic de-identification
- Identifying patient smoking status from medical discharge records
2008 - Obesity
- Recognizing Obesity and Co-morbidities in Sparse Data
2009 - Medication
- Extracting Medication Information from Clinical Text
- Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication Challenge
2010 - Relations
- 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text
2011 - Coreference
- Evaluating the state of the art in coreference resolution for electronic medical records
2012 - Temporal Relations
- Evaluating temporal relations in clinical text: 2012 i2b2 Challenge
- Annotating temporal information in clinical narratives
2014 - Deidentification & Heart Disease
- Creation of a new longitudinal corpus of clinical narratives
- Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1
- Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus
2018 (Track 1) - Clinical Trial Cohort Selection
- Cohort selection for clinical trials: n2c2 2018 shared task track 1
2018 (Track 2) - Adverse Drug Events and Medication Extraction
- 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records

Based at Partners HealthCare System in Boston from 2004 to 2014, under the leadership of Principal Investigator Isaac Kohane, MD, PhD, and Executive Director Susanne Churchill, PhD, the i2b2 Center was a passionate advocate for the potential of existing clinical records to yield insights that directly impact healthcare improvement. Recognizing the value locked in unstructured text, i2b2 provided sets of fully deidentified notes from the Research Patient Data Registry at Partners for a series of NLP Shared Task challenges and workshops, which were designed and led by Co-Investigator Özlem Uzuner, MEng, PhD, originally at MIT CSAIL and subsequently at SUNY Albany. Those notes were then made available to the community for general research purposes, and have already enabled hundreds of journal and conference articles by the research community.

These data sets now remain under the stewardship of the Department of Biomedical Informatics at Harvard Medical School, where Drs. Kohane and Churchill are Chair and Executive Director, respectively.

The NLP Shared Task challenges and workshops continue to be directed by Dr. Uzuner, now Department Chair and Associate Professor of Information Sciences and Technology in the Volgenau School of Engineering at George Mason University. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges) — a name that pays tribute to their i2b2 origins.

The software development component of the former i2b2 Center is now under the direction of the i2b2 tranSMART Foundation, a member-driven non-profit foundation developing an open-source / open-data community around the i2b2, tranSMART and OpenBEL translational research platforms.

Registration is not open for Unstructured notes from the Research Patient Data Registry at Partners Healthcare (originally developed during the i2b2 project) at this time.