Preparing Research Data for Public Use

The completion of a research project typically involves production of reports for publication as well as the creation of public use files. Some academic journals require that a researcher’s data be made available to reviewers or to the general public to ensure that the results can be replicated. In some cases, researchers may be required to create public use files from their data as a condition of receiving a grant. Preparing research data files for public use means removing or recoding information that could be used to determine the identity of the research subjects. Research data are described as “de-identified” when all information that could be used, directly or indirectly, to identify an individual has been removed. In the Federal IRB regulations de-identified research information is defined as “information recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects.” The HIPAA Privacy Rule for protected health information specifies eighteen categories of information that must be removed in order to de-identify data. These include:

  • Names
  • All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP Code, and their equivalent geographical codes, except for the initial three digits of a ZIP Code if, according to the current publicly available data from the Census Bureau
       - The geographic unit formed by combining all ZIP Codes with the same three initial digits contains more than 20,000 people
       - The initial three digits of a ZIP Code for all such geographic units containing 20,000 or fewer people are changed to 000
  • All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
  • Telephone numbers
  • Facsimile numbers
  • Electronic mail addresses
  • Social security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Account numbers
  • Certificate/license numbers
  • Vehicle identifiers and serial numbers, including license plate numbers
  • Device identifiers and serial numbers
  • Web universal resource locators (URLs)
  • Internet protocol (IP) address numbers
  • Biometric identifiers, including fingerprints and voiceprints
  • Full-face photographic images and any comparable images
  • Any other unique identifying number, characteristic, or code, unless otherwise permitted by the Privacy Rule for re-identification

In addition to removing the PII described above, researchers must also examine their data carefully to determine whether small cell sizes for certain variables might make it possible to infer the identity of individual who have participated in their research study.   In such cases, researchers should use “top” or “bottom” coding to collapse outlying high or low values into categories with sufficient numbers of cases.  

For additional information on preparing research data files for public use, click here.