The Center’s data warehouse in Health Care Policy maintains one of the most comprehensive data archives of population-based health care information for research purposes held within Harvard Medical School and Harvard University. We currently house: administrative billing claims at the national, state, and private payer levels; international, national, and regional survey data; clinical data at the procedure level in clinical registries; and linked billing and electronic health record data. In addition we utilize resources housed in data centers at other institutions through contractual agreements to expand the breadth of our research.

Many projects utilize the Centers for Medicare & Medicaid Services (CMS) data stored at the Center as well as through the CMS Virtual Research Data Center (VRDC), a virtual research environment, and the National Bureau of Economic Research (NBER) center. The Center is in the process of acquiring multiple years of the CMS data to be housed at the Markley Data Center enclave.  Below is an overview of the breadth of data assets available for the Center's projects.  Additional data set codebooks and manuals for data stored at HCP are available on the HCP Harvard Intranet CHDA site.

  • CMS Medicare Data

    The Center has projects accessing CMS Medicare and Medicaid data on remote enclaves through VRDC/ResDAC, and NBER, as well as data purchased and stored on Health Care Policy servers.  Medicare is the primary health insurance program for people age 65 or older, people under age 65 with disabilities, and people of all ages with End-Stage Renal Disease (ESRD) or Lou Gehrig’s disease (amyotrophic lateral sclerosis; ALS).  This data covers over ten years of fee-for-service claims, starting in 2008.  The CMS data under an HCP DUA can only be accessed by HCP faculty and programming staff in our level 4 data enclave at the Markley Data Center.  

    Master Beneficiary Summary File (MBSF)

    This includes the base Medicare A/B/D segment which includes beneficiary enrollment information, such as the beneficiary unique identifier, state and county codes, zipcode, date of birth, date of death, sex, race, age, monthly entitlement indicators (A/B/C/D), reasons for entitlement, and monthly managed care indicators (yes/no). As of 2006, it includes variables specific to enrollment in Part D, and as of 2017 Part C was added.  The MBSF has four segments, 1) Beneficiary Summary File or Medicare enrollment, 2) Chrinic Conditions, 3) Cost & Utilization, and 4) NDI death information with ICD-10 cause of death through 2008.

    Institutional Claims Files

    These files contain claims from institutional providers and/or settings which are covered by the Medicare Part A. In addition, claims for institutional-based services covered by the Medicare Part B benefit (e.g., home health, hospital outpatient) appear in the institutional claims file.

    Outpatient Claims File

    This file contains final action, fee-for-service claims data submitted by institutional outpatient providers. Examples of institutional outpatient providers include hospital outpatient departments, rural health clinics, renal dialysis facilities, outpatient rehabilitation facilities, comprehensive outpatient rehabilitation facilities, and community mental health centers.

    Carrier File

    This is also known as the Physician/Supplier Part B claims file and contains final action fee-for-service claims submitted on a CMS-1500 claim form. Most of the claims are from non-institutional providers, such as physicians, physician assistants, clinical social workers, nurse practitioners and free-standing facility claims.


    This file contains inpatient hospital and skilled nursing facility (SNF) final action stay records. Each MedPAR record represents a stay in an inpatient hospital or SNF. Each MedPAR record may represent one claim or multiple claims, depending on the length of a beneficiary's stay and the amount of services used throughout the stay.

    Skilled Nursing Facilities

    This file contains information on services provided to Medicare beneficiaries residing in skilled nursing facilities. The SNF public use file contains information on utilization, payment (allowed amount, Medicare payment and standard payment), submitted charges and beneficiary demographic and chronic condition indicators organized by CMS Certification Number (6-digit provider identification number), Resource Utilization Group (RUG), and state of service.

    Durable Medical Equipment

    This claim file contains final action claims data submitted by Durable Medical Equipment suppliers. Some of the information contained in this file includes diagnosis, (ICD-9 diagnosis), services provided (CMS Common Procedure Coding System (HCPCS) codes), dates of service, reimbursement amount, DME provider number, and beneficiary demographic information. Each observation in this file is at the claim level. Data also contains Home Health, Hospice, Inpatient, Outpatient, Skilled Nursing Facility claims files, as well as MedPAR.


    This file contains final action claims submitted by hospice providers. Once a beneficiary elects hospice, all hospice related claims will be found in this file, regardless if the beneficiary is in Medicare fee-for-service or in a Medicare managed care plan.

    Home Health Agency (HHA)

    This file contains data from the Healthcare Cost Report Information System (HCRIS) Data Set and includes final action fee-for-service claims submitted by home health agencies. 

  • Linked Medicare Data

    CALGB Data

    The analytical dataset is comprised of two linked data sets: CMS data and Cancer and Leukemia Group B (CALGB) data from Duke University.

    Health and Retirement Survey (HRS) Medicare Linked Data

    HRS endeavors to obtain information about health care costs and diagnoses from Medicare records. The HRS asks all respondents who are eligible for Medicare to provide their identification numbers; over 80% of them consent to do so. The current CMS data linkage includes HRS respondents interviewed through the 2012 wave who have consented to the Medicare data linkage. The HRS Medicare Claims and Summary Data Cross-Reference file is required to merge the CMS files created by Acumen LLC with HRS survey data.  Data is lined with the Beneficiary summary file, Carrier file, and DME file.

    SEER-Medicare Linked Database

    The current SEER-Medicare linkage is updated biennially. As of November 2014, the data includes all Medicare eligible persons appearing in the SEER data who were diagnosed with cancer through 2011, and their Medicare claims through 2013. There are a large number of people and records per person in the SEER-Medicare data. Given the vast amount of data, the term "SEER-Medicare data" actually refers to a series of files. One file includes SEER data, while the remaining files are the Medicare files for specific types of service, e.g. hospital, physician, outpatient, etc. 

    There are two cohorts of people included in the SEER-Medicare data -- persons with cancer and a random sample of Medicare beneficiaries who do not have cancer. The "non-cancer" group is drawn from a random 5 percent sample of Medicare beneficiaries residing in the SEER areas. Persons in the 5 percent sample who also appear in the SEER data are removed, leaving a sample of non-cancer cases. Medicare claims are available for the non-cancer cases in the same format as for the cancer cases.

    The PEDSF and SUMDENOM files are entitlement data, while the MEDPAR, Carrier, Outpatient, Home Health Agency, Hospice, Durable Medical Equipment, and Part D Event files are claims files. 

  • State Medicaid, Registry and All Payers Claims Data (APCD)

    Arkansas All Payer Claims Database

    The Arkansas APCD is a dynamic tool that enables the state to further its transparency objectives by collecting health care data from public and private sources and empowering Arkansans with information to better understand how and where healthcare is being delivered and how much is being spent.

    Colorado All Payers Claims Database

    This data was received through the Center for Improving Value in Health Care (CIVHC), a not-for-profit organization. Through services, health data, and analytics, CIVHC partners with change agents to drive towards three aims for all Coloradans, to improve palliative care, transitions of Care, and payment reform.

    Florida Center for Health Information and Policy Analysis

    Emergency Department Data includes hospitals with emergency room department data. Data collection began with calendar year 2005. Ambulatory Surgery Data contains detail patient data available beginning 1997, which includes freestanding ambulatory surgical centers, hospitals with outpatient services, cardiac catheterization centers, and lithotripsy centers.

    Massachusetts Cardiac Surgery and Intervention Registry

    The overall goal of this data collection is to enhance transparency and monitor quality of cardiac services in Massachusetts. This reporting has created a mechanism to expand access to cardiac services across Massachusetts, and has positioned the Commonwealth as a leader in gathering and reporting on PCI.

    Massachusetts Center for Health Information and Analysis (CHIA)

    CHIA collects detailed financial and patient-level data from Massachusetts payers and providers. CHIA data includes All Payers Claims Database (APCD) and acute hospital case mix data which includes hospital inpatient administratve data, outpatient observation room data, and hospital emergency department data.

    New Jersey Department of Health and Senior Services/HCQA

    Release of New Jersey data files by Health Care Quality Assessment (HCQA) for the years 2008-2013. New Jersey Discharge Data Collection System records – including all relevant National Provider Identifiers. 

    New York State Department of Health Medicaid Data 

    Medicaid confidential data from NYSDOH. Medicaid confidential data includes all information about a recipient or applicant, including enrollment information, eligiblity data, and protected health information. Medicaid provider enrollment data includes Participating Provider Network Reports. 

  • Private Healthcare Data Sources

    FAIR Health, Inc

    The Fair Health data limited subset is from the FH NPIC (National Private Insurance Claims) database.  This database is a collection of privately billed medical and dental procedures based on data contributions from payors nationwide. FAIR Health collects all data fields reported on medical and dental claims, including diagnoses, procedures, dates and places of service, NDC codes, billed charges, allowed amounts reimbursed and other information. FAIR Health licenses de-identified, claim-line-level data.  The first data pull is for data across 2013 and 2014 of ten (unidentifiable) people in large geozips, with quarterized DOB's, and common diagnoses and procedures for the age bracket. The second pull is of data across 2013-2014 for three specific (unidentifiable) providers in large geozips that belong to differing large specialties, covering common diagnoses and procedures, with de-identified patient ID's and quarterized dates of birth. 

    Truven Health Analytics MarketScan

    The MarketScan data warehouse is a family of databases that contain individual-level healthcare and dental claims, lab test results, health-risk assessments, absence, short-term disability, workers’ compensation, and hospital discharge information from large employers, managed care organizations, hospitals, and Medicare and Medicaid programs.  


    This data provided by Anthem on behalf of Health Core, Inc. The data relates to CalPERS plans and Anthem Plans from July 1, 2012 - June 30, 2015. Data elements include unique patient identifier, gender, age, CalPERs member flag, zip code of residence, copay, diagnosis codes, type of service, provider ID, and provider NPI.

  • Additional Data Resources

    Below is a sampling of additional resources used by the Center.  More details for all data assets are available to Center researchers on the HCP interal site. A list of additional resources is at the bottom of this section.

    CAHPS Survey

    The Consumer Assessment of Healthcare Providers and Systems (CAHPS) survey is a multi-year initiative of the Agency for Healthcare Research and Quality (AHRQ) to support and promote the assessment of consumers' experiences with health care. The surveys ask consumers and patients to report on and evaluate their health care experiences, covering aspects of quality that consumers are best qualified to assess. 

    The Center is an integral participant in the CAHPS Medicare Adtvantage (MA) Plan survey for CMS and developed the CAHPS Analysis Progra (SAS macro) for AHRQ.  The MA survey collects information about Medicare beneficiaries’ experiences with, and ratings of, Medicare Advantage plans, Medicare Advantage Prescription Drug (MA-PD) plans, and stand-alone Medicare Prescription Drug Plans (PDP) via surveys of beneficiaries who have been enrolled in their plans for six months or longer. Although all three versions have a nearly identical set of core questions, each version also includes additional questions and response categories related to the enrollees' experiences in their own particular plan type.  The health plan survey has been conducted annually since 1998, and the drug plan surveys were added in 2007.

    Cross-Wave Geographic Information (Detail)

    This is a longitudinal panel study, from the University of Michigan Health and Retirement Study, that surveys a representative sample of approximately 20,000 people in America, supported by the National Institute on Aging (NIA U01AG009740) and the Social Security Administration. The Cross-Wave Geographic Information (detail) data set contains state-level geographic information for respondents interviewed in 1992 through 2014, matching the 2014 tracker file. It is released in conjunction with four other files, the public Respondent Region and Mobility file and the restricted Cross-Wave Geographic Information (State)Parent State, and Child ZIP Code files. These five files contain all restricted geographic information available for HRS respondents as of 2014. The files are keyed on Household ID (HHID) and Person number (PN).

    Healthcare Cost and Utilization Project (HCUP)

    The HCUP data is a family of health care databases and related software tools and products developed through a Federal-State-Industry partnership and sponsored by the Agency for Healthcare Research and Quality (AHRQ). HCUP databases bring together the data collection efforts of State data organizations, hospital associations, private data organizations, and the Federal government to create a national information resource of encounter-level health care data (HCUP Partners). HCUP includes the largest collection of longitudinal hospital care data in the United States, with all-payer, encounter-level information beginning in 1988.

    HEDIS and Quality Measurement

    This is a tool used by more than 90 percent of America's health plans to measure performance on important dimensions of care and service.

    MCBS Access to Care

    The Access to Care module contains information on beneficiaries' access to health care, satisfaction with care, and usual source of care. This module contains 4 panels of participants, the newly enrolled participants, and the three continuing panels. The Access to Care represents the always-enrolled population for a given year. This module is typically released within two years of the survey.

    MCBS Cost and Use

    The Cost and Use module contains complete expenditure and source of payment data on all health care services, including those not covered by Medicare. This module combines Medicare claims data with survey-reported health care events to produce a comprehensive picture of health services received, amounts paid, and sources of payment. This module contains the three continuing panels. The Cost and Use module represents the ever-enrolled population for a given year. This module is typically released within three years of the survey.

    University of Desarrollo, Santiago, Chile

    De-identified data from a panel survey of Chileans who were exposed to the 2010 earthquake/tsunami. Conducted by the Chilean government. 

    World Mental Health Survey

    De-identified data collected as part of the World Health Organization World Mental Health Initiative surveys. The surveys are being performed in 27 countries, which include nearly 200,000 respondents.

    Other Resources Used: