Implementing a data managementinfrastructure

An adequate data management infrastructure can help you work more flexibly, easily and quickly. It can also simplify version control and collaboration with research partners. Designing your data management infrastructure is part of your data management plan, so it should preferably take place before you start collecting data.

Your data management infrastructure must allow for:

  • the collection, storage, and analysis of your data; this is often called a 'database';
  • sufficient data protection measures (discussed in chapter 'Protecting you data');
  • accurate management and logging of data access (discussed in chapter 'Protecting you data' and chapter 'Giving access to your data');
  • storage of metadata, process flow description, data provenance description, data extraction documentation, and data modification logs (described further in chapter 'Capturing metadata');
  • support for data interpretation.

Frequently Asked Questions

Selecting a 'data management infrastructure' concerns much more than simply selecting a particular database system or storage software. We use the word 'infrastructure' to denote the entire set of principles, process flow, design, architecture, implementation, user interfaces, and user (support) applications. Together these provide an environment to collect, process, monitor, improve, and extract data for scientific research purposes, subject to general principles like the 'Code Goed Gedrag' (for observational studies) and the 'WMO' (e.g., for intervention studies), current privacy regulations, and current practices like GCP and GRP.

There is no general answer to this question. However, it is usually not possible for a single researcher to set up and maintain a compliant, secure data management infrastructure by him- of herself. That is why every university offers infrastructural solutions for data management. Furthermore, the NFU ensures collaboration between UMCs (Data4Lifesciences). Experts at your UMC can help you choose an appropriate data management infrastructure for your study.

As soon as (in)direct identification of human study subjects is possible, you should use a professional data management system. The system and its environment should preferable be NEN7510(2011) or ISO27001 certified . However, this is not always practically achievable yet. In the absence of such certification you have to at least be able to demonstrate that the underlying essential goals of the NEN7510 (i.e., protection, accountability, privacy, documentation, risk assessment, quality management) are met. Ask your UMC's experts for lists of demands.

Note that for clinical trials, registering the reason for any modification after initial data entry of a data element is mandatory. Your system should provide some means of storing this information.

A Word document or an Excel file is not an appropriate structure to store your primary source data. The scientific quality of the data cannot be ensured in these formats due to the unstructured nature of text editing documents and the lack of data integrity protection in spreadsheets. Privacy protection is also difficult since it is difficult to restrict and audit access to such simple structures. A spreadsheet can however be extremely useful to monitor, display and analyse your primary source data.

Once you start creating and processing data, it can easily become disorganised. Naming and organising your files in a clever way can help you save time and prevent errors. The best time to think about this is at the start of your project.

A naming convention will help you:

  • to provide consistency, which will make it easier to find and correctly identify your files;
  • prevent version control problems when working on files collaboratively;
  • prevent (human) errors.

You will help yourself and your colleagues to find what you need when you need it by organising your files carefully. This will save you all both time and frustration and it will prevent duplication or errors.

You can consider deleting intermediate files produced during data processing in order to save storage space and to reduce the risk of inadvertent privacy violations. You can also exclude them from a backup scheme to save time on a possible restore after hardware failure. However, for trace-back reasons it may be useful to keep intermediate data.

In order to allow others to reuse your data in the future, you should use a standardised (harmonised) protocol for data collection. This ensures that follow-up studies will have a homogeneous data set.

You will probably also need to record parameters that seem irrelevant to your own study. For instance:

  • geographical area of data collection;
  • instruments used;
  • demographics;
  • time between collecting samples and performing measurements.

Remember that data interpretation depends crucially on knowledge of the data collection process as well as methodological knowledge. Of course, the amount of documentation required varies between studies.

You need to properly document all steps of your research:

  • documentation of agreements and decisions;
  • experimental notes (e.g., in Lab Notebooks (lab journal), etc.);
  • the operational workflow ( i.e., you should document the origin of all stored data and make this verifiable. In addition, you need to make sure that the (selection) processes that have led to the actual storage are verifiable);
  • all research data (and any changes to it);
  • metadata;
  • reference of the data collection in a metadata catalogue;
  • used IT standards and code book;
  • the set of conditions for data sharing.

Note that for clinical trials, monitoring of data is mandatory as a part of Good Clinical Practice (GCP).

The method of monitoring should be specified in the protocol of the trial.

Text in preparation