The modern age’s passion for collecting has a name: Big Data. Scientists hope to gain new insights from the analysis of large amounts of data, but this requires certain rules and the right infrastructure. Two examples from Forschungszentrum Jülich show how to collect the treasure of data.
People love collecting. For some, collecting and storing are even part of their profession. Scientists are among these. Not only do they accumulate knowledge, but they also produce and record vast amounts of data and information from which knowledge has still to emerge.
Modern technology facilitates the collection of data immensely and tempts people to hoard. Experts estimate that 90 per cent of the data available worldwide was collected in the past two years alone. In science, ever more precise experimental equipment, measurement systems and computer simulations are generating ever larger amounts of data. At CERN, the European Organization for Nuclear Research, experiments generate around 50 petabytes of data per year; the Jülich Supercomputing Centre (JSC) generates around 20 petabytes of data per year in simulations alone. Experts speak of big data – data that cannot be evaluated using previous manual and conventional methods.
Big data is far more than a passion for collecting. It offers the chance to reveal new relationships and patterns from the large amount of data that would not be noticeable in small samples. However, “big” in itself does not provide new findings. Ordering, filtering, evaluating, but also sharing and exchanging data – these are the great challenges in order to turning big data into smart data, that is, into data from which meaningful information can be obtained. These challenges pose new requirements on IT infrastructure, data handling and scientific collaboration. Two examples from Jülich show what big data means in everyday research.
Dr. Andreas Petzold and his colleagues from the Jülich Institute of Energy and Climate Research (IEK-8) are old hands in the big data business. “In the IAGOS project, our measuring instruments have been travelling around the world in commercial airliners for over 20 years. They take a measurement every four seconds during flights, for example, to record greenhouse gases such as carbon dioxide and methane, other reactive trace gases such as ozone, carbon monoxide and nitrogen oxides, but also particulate matter, ice and cloud particles. During that time, more than 1.4 billion measuring points on 320 million flight kilometres have been gathered.” It goes without saying that such amounts of data can no longer be evaluated using a calculator. The scientists had to develop tailor-made software packages, for example for calibrating the devices and for transferring and evaluating the data.
Big data everywhere
Today, large amounts of data accumulate in all Jülich departments. These include the classically data-intensive disciplines such as nuclear physics and climate research. However, electron microscopy, structural biology and automated image analysis in plant research also generate mountains of data that can no longer be analysed using conventional methods.
The climate data serve to identify long-term trends such as air pollution or greenhouse gases. Global data and findings like these are of interest to researchers worldwide. When it comes to the climate, it is particularly important to combine data from different sources – also because numerous factors are intertwined here: soils, plants, animals, microorganisms, bodies of water, the atmosphere and everything that humans do.
Making data comparable
Until now, such data has too often been collected separately from each other and also separately turned into models. This is supposed to change. In early 2019, several large-scale infrastructure projects were launched on a European level with the aim not only of securing the individual data treasures in a well-structured, long-term manner, but also of making them comparable.
When researchers from different disciplines, institutions and countries pile up their mountains of data into even larger mountains of data, one thing is needed first: common standards. These are too rare up to this point. Such rules are to range from survey methods in the field and quality assurance of measurements to the verifiability of the data. These standards already exist within projects: “In the IAGOS project, for example, we provide each measuring point with a whole series of metadata. These are effectively the keywords for every measurement: what, when, how and where, temperature, flight number and measuring device. This also enables external or subsequent researchers to understand what we measured how and where,” emphasises Petzold.
Cross-project standards are now needed. This is exactly what ENVRI-FAIR, the European infrastructure project for environmental sciences, wants to introduce. ENVRI stands for Environmental Research Infrastructures because all established European infrastructures for earth system research are involved in the project – from local measuring stations and mobile devices such as IAGOS to satellite-based systems. FAIR describes the requirements as to how researchers are to collect and store the vast amounts of data in the future: findable, accessible, interoperable and reusable.
Petzold is coordinating this mammoth project, which will receive € 19 million in EU funding for four years. “ENVRI-FAIR will enable us to link and relate different data to each other – the basis for turning our big data into smart data that can be used for research, innovation and society,” he says. As with all other European infrastructure projects, open access via the European Open Science Cloud, which is currently being set up, is planned so that as many researchers as possible will be able access the data troves.
In order to realise such ambitious plans, the experts need the support of IT specialists – for the upcoming expansion of IT infrastructures, for example, and for data management and computer centres. At Forschungszentrum Jülich, the Jülich Supercomputing Centre (JSC) is available as a partner with extensive expertise: among other things, it offers two supercomputers, suitable computing methods, enormous storage capacities of several hundred petabytes and around 200 experts on a wide variety of topics. The JSC supports ENVRI-FAIR, for example, in setting up an automated management system for the large data streams. One of the main topics in this context is data access. Today, in international projects with many cooperation partners, it is more and more important to ensure that large datasets – and the conclusions drawn from them – can be examined and verified by all participating research groups.
For this purpose, new computer architectures that can handle and evaluate big data particularly well, such as JUWELS and DEEP, are being developed at Jülich (see box on the right). In order to improve the exchange between high-performance computing specialists and expert scientists, the JSC has also set up simulation laboratories in which the various experts work closely together. They support researchers in the general handling of big data and in evaluations – also with the help of machine learning.
“The experts for machine learning and the specialists for high-performance computers know how large amounts of data can be used for learning scientific models on supercomputers. Domain specialists such as biologists, physicians or materials scientists can in turn formulate meaningful questions about their specific data so that learning evolves in the direction relevant for the solution of a given problem. In such cooperation, adaptive models – such as deep neuronal networks – can be trained with the available data to predict processes in the atmosphere, in biological systems, in materials or in a fusion reactor,” explains Dr. Jenia Jitsev, researcher expert for deep learning and machine learning at the JSC.
One of the Jülich researchers working closely with the JSC is Dr. Timo Dickscheid, head of the Big Data Analytics working group at the Jülich Institute of Neuroscience and Medicine (INM- 1). His institute also generates an enormous amount of data because it is concerned with the most complex human structure: the brain. “We are developing a three-dimensional model that takes both structural and functional organisational principles into account,” says the computer scientist.
He has already worked on BigBrain, a 3D model assembled from microscopic images of tissue sections of the human brain. In over 1,000 working hours, 7,404 ultra-thin sections were prepared and digitised by the Jülich brain researchers together with a Canadian research team.
Surfing through the brain
“This 3D brain model is about one terabyte in size,” says Dickscheid, “so it is already a challenge to display the image dataset smoothly on the screen – not to mention the complex image analysis methods that automatically process this data set on the Jülich supercomputers to gradually generate precise three-dimensional maps of the different brain areas.” With these data sizes, it is no longer feasible that the scientists map these areas manually and completely. Dickscheid and his colleagues spent three years programming intensively and exchanging ideas with the JSC.
The result: despite the large database, the program makes it possible to navigate smoothly through the brain and zoom to the level of cell clusters. The trick is: “We don’t provide users with the entire dataset in full resolution, but only the small part that they are currently looking at,” explains Dickscheid. “And this is in real time,” he adds. The BigBrain model and the 3D maps are a prime example of shared big data. They can now be clicked on, rotated, zoomed in and marvelled at by anyone on the Internet.
Scientists from all over the world are making use of it: the three-dimensional representation enables them to assess spatial relationships in the complicated architecture of the human brain much better than before – and gain new insights. Dutch scientists, for example, hope to use the atlas to better understand the human visual cortex at the cellular level and use this knowledge to refine neuroimplants for blind people.
JUWELS and DEEP
JUWELS is a highly flexible and modular supercomputer whose adaptable computer design was developed at Jülich and is aimed at an extended range of tasks – from big-data applications to computationally complex simulations. The abbreviation stands for "Jülich Wizard for European Leadership Science". Jülich researchers are also developing new modular supercomputer architectures within the European DEEP projects, which can be used for scientific applications even more flexibly and efficiently than previous systems.
“Making results such as our different brain maps accessible to all is a cornerstone of science,” says Professor Katrin Amunts, Director at the Institute of Neuroscience and Medicine and Dickscheid’s head. Making the underlying data publicly available, however, calls for a paradigm shift in research: “Publications of scientific studies currently play a much more important role than publications of data. Within the research community, we must agree that the authors of the data should be named and cited on an equal footing with the authors of a scientific publication. Here, too, FAIR Data is a very central point – data should be findable, accessible, interoperable and reusable – an approach that the Human Brain Project is actively promoting,” emphasises Amunts, as publications are the currency with which research is traded and careers are made.
Sharing as an oportunity
The astrophysicists are regarded as a shining example. “Here, it has historically evolved that data are shared impartially,” says Dr. Ari Asmi from the University of Helsinki, who is a colleague of Andreas Petzold and co-coordinator of ENVRI-FAIR. The most recent example is the sensational photo of the black hole. “This was only possible because, firstly, the global scientific community in radio astronomy is extremely closely networked and, secondly, a radio telescope may only be used if the data obtained with it are subsequently disclosed.”
From Asmi’s point of view, sharing big data is a great opportunity for research: “It will be really exciting if we manage to use the new methods to combine data from different disciplines, such as our environmental calculations with data from the social, political and economic sciences. If this succeeds, we will be able to obtain viable models to, for example, understand climate change in its entirety and to draw up action plans for the future.” And then collecting will become not only big, but really smart.