Preparing your data for analysis

Preparing your data for analysis usually follows these steps:

  • creating a data dictionary; this is also important for reproducibility and reuse of data;
  • creating a working copy of the data set, while making sure that you keep the raw data intact;
  • cleaning your data in the working file, while documenting all cleaning steps in a separate file that you archive;
  • creating an analysis file, while keeping the cleaned data set intact for archiving purposes;
  • preserving your raw data and intermediate data sets.

Frequently Asked Questions

When your data cannot be traced back to individuals (i.e., anonymized data), it is possible to use any decent statistical package as the management tool for your primary study source data. You should however make sure that the entire process is well-documented and that all manipulations to the data are documented in libraries of syntax files.

The origin of all your stored data should be documented and verifiable. In addition, the (selection) processes that led to the actual storage should be verifiable. The (selection) processes that have led to the data storage should be verifiable and repeatable. This means:

  • Document every process that extracts data from your infrastructure for analysis purposes ('syntax files') including used (version of) software;
  • Maintain libraries with definitions of derived data ('formulae', 'recodes');
  • Ensure uniformity across multiple users and sub-projects using the same primary data source.

Do not provide more information in a data extraction than you need for a particular analysis.

Text in preparation

Text in preparation