Skip to main content
Sign In

Dataset Format Guidelines

Best Practice for Improving Readibility of Data

We are unable to address data format issues, and may need to ask you to reformat improperly-formatted datasets. Please be sure to follow these guidelines when you format your dataset:

  • Single row for headings/column names. No repeated headings.
  • Headings not too long—use short (1 or 2 words) column headings, then use a data dictionary to elaborate the short heading. We’ll be sure the long version from the data dictionary makes its way onto figures, etc.
  • Include a separate document that defines values – a "data dictionary." See below for an example.
  • We cannot analyze "free form" or "text string" columns (such as "other," "explain," or "notes"), although you can leave them in the dataset for reference.
  • The computer ignores color, so don’t color-code data or the information that you color-coded will be lost.
  • Stick to a coding convention. Entering "F" for one woman’s sex, "f" for another’s, and "Female" for another’s results in three types of females. Pick one convention and be consistent throughout a column. Capitalization matters!
  • No "special" characters, such as text accents.
  • File types that end with .xls, .xlsx, .csv, and .sas7bdat are good.
  • Include patient IDs, provider IDs, etc.
  • Do not include any Protected Health Information (PHI).
  • Missing data should be left blank, rather than coded as "99," "-99," ".," etc.
  • No characters in a numeric column/variable. If there are characters anywhere in a column (aside from the column name), the computer will treat the whole column as characters. Putting the word "missing" or "unknown" or the character "-" for missing values in a column will convert any numbers in that column to character expressions, which would be treated as categories, not numbers, in an analysis.
  • For numeric variables, don’t include units in the cell values, as they are characters. Include the units in the data dictionary instead, and we’ll put them on figures, tables, etc.

NOTE: This list is not exhaustive.

Data Dictionary Example

For a Ventricular Tachycardia Study
  • PtID: patient ID
  • Inst: institution ID
  • Gender: gender of patient
    • M=Male
    • F=Female
  • AblNum: ablation number:
    • Numeric count
  • Fascic: tachycardia type:
    • 1=Fascicular VT
    • 0=Other VT
  • Recur: tachycardia recurrence:
    • 1=VT recurrent
    • 0=VT not recurrent
  • Follow_Up: Follow up time after this ablation
    • Time started with a successful ablation and ended when VT recurred (1 above) or
      when follow up time ended without recurrent VT (0 above)
  • Status: Final Patient Status:
    • 0=off meds, no VT
    • 1=off meds, intermittent VT
    • 2=on meds, no VT
    • 3=on meds, intermittent VT
    • 4=other
  • Other Variables…(add other variables here)

Center for Innovative Design & Analysis (CIDA)

Formerly known as the Colorado Biostatistics Consortium (CBC)
13001 17th Place | Mail Stop B119 | Room 100, Building 406 | Aurora, CO 80045
303.724.2325 |

Biostatistics Consulting | Grant Collaboration | Department of Biostatistics and Informatics | BERD | BBSR | ColoradoSPH

© The Regents of the University of Colorado, a body corporate. All rights reserved.

Accredited by the Higher Learning Commission. All trademarks are registered property of the University. Used by permission only.