Choosing file formats

Whereas databases are common in clinical research, file formats are preferred in the molecular life sciences. Ensuring that your data is FAIR requires care in selecting file formats. For instance, it is important that you consider how your data can be accessed in ten years from now: will software still exist that can read the information? You should think about such issues before you start collecting data.

The NFU recommends selecting data formats that are:

  • open (i.e., that can always be implemented, so not '.doc' and '.xls' or instrument-specific data formats);
  • well-documented ( i.e., rigorous like 'xml' with a schema description and not open to multiple interpretations like '.csv' without schema descriptions);
  • flexible (i.e., self-describing formats which can adapt to future needs without breaking old data);
  • frequently used (i.e., for which conversion tools will be created and maintained if necessary).

Frequently Asked Questions

Phenotypic data is typically collected and stored in relational databases, either as part of the healthcare process or in dedicated scientific research systems. When setting up studies in research database systems, data standards need to be considered at different levels:

'Form' standards

These standards describe which data elements 'logically' belong together and therefore can be grouped on one form. Examples of standard forms are:

  • demographics (e.g., date of birth and gender);
  • adverse events (e.g., date of disease onset; type and grade of disease).

CDASH from CDISC and NINDS are examples of standards that describe the basic recommended data collection fields for several domains. These standards are not (yet) widely adopted for investigator-driven trials.

Standards defining meaning of the data elements and providing codes allowing these data elements to be processed by computers.

Examples are pool data from different collections and studies. These standards may define and code concepts (e.g., the concept body weight), or define and code value list or response options for a particular concept (e.g., option 'female' for concept 'gender').

Several collections of terminologies are available, SNOMED CT and LOINC being the most frequently used for phenotype (clinical) data. Mapping of data elements to terminology concepts requires quite some expertise and effort. It is therefore recommended to reuse mapped data elements from existing studies in the same domain whenever possible (assuming these studies have properly mapped to study concepts to the terminologies). The standard terminologies tend to offer support for multiple languages, although Dutch is implemented only to a limited extent. Language is therefore still a relevant consideration at the start of your study.

Nictiz has a subsite dedicated to terminologies and classifications.

For more complex studies we advise you to consult an expert in data mapping.

  • Structured human readable tabular formats are preferred over more computer friendly formats like XML.
  • Use minutes from BioMedBridges workshop.
  • Use of identifiers and pitfalls (such as versions etc.).

Overview of classes: Tabular formats; Hierarchical formats XML JSON, RDF, binary?

Table 1: Recommended file formats for data types that are frequently used in biomedical studies

Data type

Commonly used formats for storage and sharing

Further considerations and notes

Phenotypic or clinical (sharing and exchanging)

csv for the addition of smaller or less commonly used data sets

more formal data exchange mechanisms from clinical care such as HL7 for larger data sets

CDISC ODM format for data exports from (and between) clinical research systems

Since relational databases are impractical to share, data sharing between clinical research systems has its own set of file or message standards.

Formal data exchange interfaces such as HL7 tend to require a considerable implementation effort.

CDISC ODM is an XML format nowadays supported by most CRF systems. It can be used for sharing of clinical data between partners, as well as for export to further data integration and analysis systems.

Genomics

BAM (compressed files) for raw sequencing data

VCF for variants in comparison with reference genome

For raw data, alternatives are being developed that have higher compression, such as CRAM, but since these formats are still under development they should be used with caution if long preservation is required.

It is essential to make sure the reference genome is preserved along with the VCF.

Proteomics

mzML for the measured data

mzQuantML for peptide and protein quantification data

mzIdentML for peptide and protein identification data

These formats have the disadvantage that they are much larger in volume than (machine specific) binary formats. Processing pipelines are therefore often operating directly on binary files. These formats are, however, not suitable for long-term data preservation.

Metabolomics

-

Standards are under development by the metabolomics standards initiative.

Imaging

DICOM

DICOM is the standard format for medical images. Medical imaging equipment manufacturers use the DICOM format to distribute images (just as digital camera manufacturers distribute images in JPEG format). DICOM files contain the images along with details about the patient, the scan that generated the image and the characteristics of the image itself.

For research on images it is advisable to also store the 'raw image files' (e.g., the images before they are digitally processed for diagnostic purposes). This allows the images to be re-processed with other (future or improved) methods which is important because different vendors use different technics producing different 'after processing' view images of the same raw image files.

  • experts within your UMC ;
  • experts outside your UMC;
  • online help.
  • Biosharing.org works to map the landscape of community-developed standards in the life sciences.
  • The data management guide of the University of Cambridge includes a table with common image formats: what to use when.