It is important for you to organize your data in
a way that facilitates transfer to our biostatisticians, or other investigators or computers. Well-defined and organized data
minimizes confusion and incorrect data.
encouraged to use REDCap for data collection to minimize data entry errors or risks to patient confidentiality, and ease data transfer for statistical analysis.
Recommendations for Organizing Data
Our recommendations have demonstrated to be effective for moving data from
point to point in a structured manner. A reasonable data organization scheme should minimize the amount of editing needed at the receiving side of your data transfer.
Table 1 illustrates three types of
variables in a structure that lends itself to simple data transfer and minimal
- Identification (PatID) variables: uniquely identify aspects of an individual record (row of data), for instance, subject #, clinic #, or PatID.
- Time-stable variables: include characteristics that remain constant for individual subject if observed over time, for instance, baseline demographics (age, sex, race) or study group (A, B).
- Longitudinal variables: potentially change over time, for instance, weight, adolescent height, muscle tone, lab values (cholesterol, blood sugar, etc.).
In this example,
the structure has one column available for identifying an individual (Subject),
two columns for time-stable characteristics (Trt, Sex) and two columns for longitudinal
characteristics (time, weight). Note
the values of subject and time uniquely identify each row.
designs will require different data structures, but each measured response must
be uniquely associated with only one subject, visit or test.
Most statistical software packages (e.g.
SAS, SPSS, Splus, R and Stata) require data represented in a
rectangular format where each row is a unique observation and each column is a
separate variable. When organizing data
into a rectangular format: first each row contains one (and only one) unique
observation. In the example each row
contains a unique combination of subject, time, and treatment. Second, each
column contains one (and only one) variable or response.
Table 1: Example of a Rectangular Table
(in a separate worksheet):
Trt: Treatment, 0=Placebo, 1=Drug; Sex: 0= Women, 1=Men; Time: Time in Study in weeks; Weight: Body weight in pounds
Please note the following points, many of which are illustrated in Table 1:
- Data table is rectangular, rows represent observations, and columns
represent variables. Some columns identify observation and others
contain a measured response. All data contained in one
Patient ID numbers are used, Protected Health Information (PHI) is not
included. Names should not be included
in your database for analysis to avoid unnecessary risks
to patient confidentiality (see Table 2).
- Unique key to each row consists of two variables (columns) PatID and Time.
- Characters (A, AB, O) and numeric values (0, 1, 2) are not mixed within one column. Where possible, a number has been chosen in
place of a character. Definition of numbers, units for
continuous data, and explanation for abbreviated variable titles should be
provided separately in a codebook.
- Missing data: Note that none of the variable values uniquely identify the subject and conditions where measurements taken are missing (ID, trt, time). A character value (e.g. "missing",
"dk", "x") or numeric value zero (i.e.,
0) should not be used to indicate missingness for a continuous variable (ex: variable "Weight" in Table
|Table 2: Identifiable PHI Information
- Fax number
- Phone number
- E-mail address
- Account numbers
- Social Security number
- Medical Record number
- Health Plan number
- Certificate/license numbers
- IP address
- Vehicle identifiers
- Device ID
- Biometric ID
- Full face/identifying photo
- Other unique identifying number, characteristic,
- Postal address (geographic subdivisions smaller
- Date precision beyond year
- Before data collection begins, your should give special attention to how an assay value below detection will be indicated in the
data, and how it should be treated in the statistical analysis. Similarly
for left-censored or right-censored values.
- Column headers are variable names, not a description. Variable descriptions can be provided separately in a "codebook" (or a separate worksheet in same workbook). In general, variable names must:
8 characters or less in length
of one word (i.e. no spaces)
unique (not duplicated across multiple columns)
with a letter, not a number
- Contain no special characters: commas, quotes, apostrophes, period, underscore.
using punctuation or spaces (e.g. commas, quotes, <,>).
- Avoid using special formatting like colored text, highlighted columns, italics, bolding, super
or sub scripting, and the "comment" feature.
notes about patients in separate column from data used in
analysis (e.g. "scheduled to come in again for repeat lab"). If information in text of notes needs to
be analyzed, it should be coded into one (or more) variable column(s).
If considered in enough detail
before your data collection process begins, organization of the experimental
data is relatively simple. Whether or
not there are questions or confusion about how to efficiently organize and
manage your data, consulting with a statistician before your experiment begins is a
good idea. These matters can usually be resolved
in a short time with satisfactory results for all concerned. Biostatisticians often oversee the data
collection, storage, and retrieval systems for clinical studies. The study biostatistician is able to distinguish
between essential and non-essential data, and can therefore limit the data
collection systems to relevant information.
Limiting the amount of data collected
means it is easier to assure data quality, minimize missing data, and
pre-define the analysis data sets so that, upon study completion, data
analysis is straightforward. Developing
an effective data collection and management system is a key step in assuring
ultimate integrity of your study. Dataset
planning can be iterative, involving meetings between the Statistician,
Investigator, and Informatics Manager.
Specific examples of instances
in your planning phase where obtaining a statistician’s input would be
- Design data collection forms
- Outline data collection/management systems
(include variable name, specify variable type, e.g. date, numeric,
- Design, implement, and conduct of data quality monitoring system for a study
- Outline how and when data abstraction should
occur for interim analyses
- Provide input on parameters that would help to
ensure data quality control
- All data should be securely stored, and access should be restricted to those individuals entering data.
- Properly dispose of paper and electronic files, keep paper copies in locked cabinet, and store electronic files on a secure-access central server.
- Keep in mind the Health Insurance Portability and Accountability Act (HIPAA)’s Minimum Necessary Principle when listing what variables to include in your database.
- Use or disclose only information necessary to the task. It is important to exclude unnecessary items that make information identifiable to ensure privacy, security and patient confidentiality.
- Identifiable information includes items listed in Table 2. If identifiable information is necessary for research (e.g. birth date, visit date, physical address), take necessary precautions to protect the database: strong passwords, anti-virus software, data backup, possibly encryption, and being very cautious with email.
- Refer to COMIRB and HIPAA for additional stipulations.