Newsletter articles: Data4lifesciences news #2

November 18, 2016

In this second newsletter of the UMCs' programme Data4lifesciences, we would like to highlight the work package "Access to data and sample collections", coordinated by Dr Morris Swertz. Also in this newsletter: more information about successful establishment of a data analysis pipeline infrastructure connecting high performance computing clusters, Health-RI on the KNAW 'Agenda for Large-scale Research Facilities’ and the upcoming conference on personalized health and medicine, December 1 in Amersfoort.

Access to data and sample collections: reaping what we have sown

November 18, 2016 Access to data and sample collections: reaping what we have sown

With around 200 biobanks, the Netherlands is a frontrunner in collecting large data and sample sets from population and patient cohorts. “Unfortunately, the actual use of this valuable material lags behind,” says Dr Morris Swertz. He is the leader of the Data4lifesciences work package 'Access to data and sample collections', which aims to make it easier for researchers to use biomaterials and data collected by others.

Swertz: “This Data4lifesciences project is coordinated by the Dutch biobanking infrastructure BBMRI-NL2.0. Our aim is to build a clever catalogue that assists researchers in locating and obtaining biomaterials and data collected by others.” Swertz is IT Lead at BBMRI-NL2.0 and associate professor ‘Big data in biomedicine’ at University Medical Center Groningen. “Complex biomedical research questions often call for large groups of study subjects to reach sufficient statistical power. And rare disease cases are often few and far between. Researchers will have to combine samples and data from multiple biobanks to obtain adequate sample sizes.”



With millions of samples stored in the Dutch biobanks, obtaining a sufficient number of samples seems feasible. In addition, patients and the general public expect researchers to use the available materials to improve healthcare, for instance by developing personalised medicine solutions. However, the relevant samples are typically scattered over dozens of laboratories, each with its own specialised database. Swertz: “A PhD student may spend half of her PhD project trying to locate the samples and integrate the data items between them, which is quite sad.”


Finding biobanks

“In 2010, we discovered that poor findability of biobanks was the first major obstacle to sample use. So we set out to develop a catalogue of Dutch biobanks. David van Enckevort from my research group in Groningen has been leading this work from the start.” The first version of the catalogue only contained basic information, such as the biobank’s name and topic (e.g., disease-specific or general population), contact details of the coordinator, type of biomaterials, and type of additional data available (e.g., demographic data, questionnaire data). “This catalogue was simple, but it transformed the Dutch biobanks from hidden gems into findable research resources. And it formed the stepping stone to build something smarter.”


Finding individual samples

Swertz continues: “Soon, several biobanks wanted to take the catalogue to the next level by adding information about individual samples. It is easy to see how this would make a researcher’s life easier. For instance, let us assume that a researcher wants to study the relationship between gene expression profiles and the outcome of chemotherapy for a specific type of cancer in females above 60 years old. The first version of the biobank catalogue would only reveal which Dutch biobanks contain expression data and tumour samples of the right type. However, it would save the researcher a lot of time if the catalogue would reveal how many tumour samples of females above 60 with known chemotherapy outcome a biobank could provide.”

The national pathology archive ‘PALGA’ was one of the first biobanks for which the team developed such an improved catalogue. “We built the PALGA catalogue on an open source software system called ‘MOLGENIS’, developed by six programmers. MOLGENIS provides the flexibility to quickly change the catalogue’s structure. The PALGA public catalogue enables researchers to assess whether they can answer their research question with the material in the Dutch pathology archives, while protecting the privacy of patients. In collaboration with PALGA, Erik van Iperen is now further automating the request workflow and assisting in the logistics, as part of the virtual portal for sample requests ‘Dutch National Tissue Portal’. PALGA is already receiving twice to three times as many requests as before the launch of this sophisticated catalogue, so it seems to deliver on its promise to enhance sample reuse.”


Finding and harmonising data items

The original catalogue has also been expanded to enable searching for available data items. This was first done for the LifeLines biobank. The team recently developed a new tool called ‘MOLGENIS/connect’, a semi-automatic system to find, match and pool data from different biobanks. “Each biobank contains different data items and to make matters worse, they name and describe similar items differently. MOLGENIS/connect can automatically transform free text descriptions into consistent 'ontology' codes, speeding up the search process. In addition, it can identify related data items. For instance, one biobank may have asked its participants how many packs of cigarettes they smoke per week, whereas another may have recorded the number of cigarettes per day. The research community can use this tool to quickly obtain an overview of what data is being collected and how it is coded. Ideally, this would result in a standardised method of data collection and coding in the future."


Firm roots

With the incorporation of the project in Data4lifesciences, the Dutch UMCs have firmly rooted the clever catalogue in their research infrastructure. “Data4lifesciences is great. While much software development has been conducted by BBMRI-NL, data updates are a major issue. By involving employees of the UMCs, Data4lifesciences can load detailed descriptions of all UMC data and material collections to increase findability and reuse. In addition, uptake by the NFU ensures that the data and system will be maintained beyond project funding. This is a very important step forward.”


Going global

“Our next goal is to fill the system with broad and deep (meta)data from as many biobanks as possible, including at least all BBMRI and UMC biobanks, and CTMM-TralT studies. In addition, we will further develop the MOLGENIS-based catalogue system that is now also used in many other European countries. And we will help develop the international ‘minimum information standard about biobank information’ (MIABIS) standard to cover additional research domains such as imaging and rare disease studies. So it should be feasible to make all European samples findable and reusable. Eventually, we want to connect biobanks worldwide.” concludes Swertz.


Project Facts

Coordination: BBMRI-NL - national catalogue and request workflow working group

Contributions from: BBMRI-ERIC common services for IT, DTL Health care data, CTMM-TraITCORBELRD-connect, LifeLines, PALGA, Parelsnoer, Radboud Biobank, and E-rare.