Veranstaltungen

05. November

10:30

Haus Ungarn, Karl-Liebknecht-Straße 9, 10178 Berlin

HIDA Datathon

    *UPDATE: We’ve postponed the HIDA Datathon, originally planned for April 2-3, 2020. It will now take place on November 5-6, 2020.*

    Data scientists save our world!

    Are you a data scientist or a data scientist in training and want to fight climate change? We’re looking for you! Be part of the first HIDA Berlin Datathon on November 5-6, 2020 and help us to fight climate change based on real-world data.

    The Helmholtz Association with its 19 research centers offers unique capabilities to research our climate and find solutions to save it, from looking at regional climate models, to spanning an earth observation network for a birds-eye view on the long-term effects of global change on various environmental systems. Become part of this extensive effort at the HIDA Datathon and get to know our scientists and the important work they do.

    Work on climate data together with our experts and find solutions to climate challenges

    Choose from interesting challenges Helmholtz scientists need your support on. Work on well-prepared datasets from satellites to weather stations and the deep sea. Get creative and employ your analytics, scientific computation, machine learning and visualization skills and come to surprising solutions together with a team of data scientists from research centers, universities and companies. Apart from great food, entertainment and a location in the heart of Berlin for everyone, our partners also have exciting and climate-friendly prizes waiting for the winners.

      Our six challenges

      Map local city climate from satellite data automatically 

      Change is made on the local level. Contribute by developing a machine learning algorithm that creates reliable maps for urban climatologists. 

      The bigger problem: By 2050, Berlin summers could be as hot as in Canberra, Australia. Pankow, a district in the city’s North, has already declared a climate emergency in 2019 and is planning ahead. It’s planting trees from the Medditeranean that withstand the heat, and has calculated computer simulations for sunshine and cold air corridors for the construction of 1200 new apartments. A few changes, like swapping asphalt and concrete that store heat against greenery that soaks up water and provides shade, can make a difference on the local scale. And many of these changes on a local scale then make a difference on the bigger scale. 

      The data science problem: To understand local climate in cities, and to make it comparable, scientists have developed the Local Climate Zone classification scheme. It differentiates between 17 zones based mainly on surface structures (e.g. building and tree height and density) as well as surface cover (green, pervious soils versus impervious grey surfaces.) There are algorithms that calculate these maps from freely available satellite imagery, but there’s still room for improvement by adapting or developing suitable and advanced Convolutional Neural Network (CNN) architectures that generalize well. Here’s where you come in: In this challenge, Xiaoxiang Zhu and her team from the German Aerospace Center ask you to come up with machine learning models that create reliable Local Climate Zone maps from satellite imagery as provided by the Sentinel missions fully automatically.

      The data: The So2Sat LCZ42 dataset is publicly available for download (here) with an open access licence (CC-BY license) and consists of three analysis-ready parts: training, validation and test data. The first two parts come with manually created, high-quality labels, while the third one is for predictions and has no labels. 


      Helpful skills to solve this challenge:  Data analytics, Machine learning

      Long hot summers, or long hot decades? Also a classic bias-variance-trade-off question

      Tomorrow's world will be warmer, but the structure of climate variability means that some decades will be warmer than others. Just how hot those hot decades will be is the important question this challenge aims to help answer.

      The bigger problem:  Ships cross the ocean in long straight horizontal and vertical lines measuring sea surface temperature (SST). As a result, there’s comprehensive SST data along a gigantic virtual grid that spans all oceans for different points in time. Due to the large amount of data and its granularity, the question is how closely you need to look to spot larger trends. Averages help, but how far apart in time and space should you pool data together? Andrew Dolman and Thomas Laepple’s group at the Alfred Wegner Institute (AWI) are trying to find the best method to answer this question and you can help them by participating in this challenge. 

      The data science challenge: In this challenge, Andrew and his team are challenging you to find the best method to average and come up with the most insightful global map of the power-spectrum of SST. Power-spectra show how variance, for instance in SST, is distributed across timescales; for example, how much of the total variation in SST is due to differences between years and how much due to differences between decades. 

      There are (at least) two problems that make this challenge tricky. First, estimates of power-spectra from time-series are noisy. However, points close together in space are likely to have similar power spectra and nearby time-scales are also likely to have similar power/variance. Thus, some spatial averaging, and perhaps averages across frequencies, to reduce some of the noise may help to produce more insightful maps of SST power spectra. The second problem is that the further you go back in time, the less SST data there is.  

      The data: Work with time-series of sea-surface-temperature (SST) for grid boxes on the global ocean. These data originally come from temperature measurements on research vessels and have been compiled by the Hadley Centre. Andrew and his team have estimated the power spectrum from the SST time-series for each grid box.

      Helpful skills to solve this challenge: Data analytics, Machine learning, Data visualization, Geographic Information Systems, Spatial Statistics, R, Python 3.x 

      Prize for the winning team: The winning team will be acknowledged (e.g. as co-authors) in resulting publications and is invited to the AWI in Potsdam to present their solution.

      Show the flow of fish larvae in a warming oceans

      Help exploring human-caused changes at the beginning of the food chain that have large potential repercussions for the ocean.

      The bigger problem: In the future, when fish larvae hatch in the Bay of Biscay on the West Atlantic Coast of France, they might grow up faster than and travel in the opposite direction of previous generations. As a result, birds and other fish up the food chain may wait for their prey at old hunting grounds in vain. Warming oceans result in a rat tail of changes around the globe: warmer temperatures may change the ocean’s dynamics that carry organisms like fish larvae. Warming oceans also affect how these organisms themselves grow and swim within them.

      To understand these changing ocean dynamics, Willi Rath and the Ocean Dynamics team at Geomar perform numerical "experiments" with regional to global ocean models simulating small-scale to global-scale ocean currents. These models simulate the observed oceanic variability over the past half century. In this challenge, the Geomar team asks you to show what changing ocean dynamics and fish larvae they carry look like.

      The data science challenge: The challenge is to create an interactive visualization of 15 million trajectories of simulated fish larvae. This includes visualizing large amounts of unstructured data, including the larvae’s positions as well as other factors like ambient ocean temperatures. Explore the best possible ways of presenting the data in a way that leverages state of the art parallel i/o and analysis stacks.

      The data: Work with approximately 15 million trajectories with 1000 time steps and 8 variables (position, temperature etc.) sampled along the way. The trajectory data is stored as ZARR. 


      Helpful skills to solve this challenge: Data analytics, Data visualization

      Finding water with cosmic rays  

      Car-bourne cosmic-ray neutron detectors can measure root-zone soil moisture across the country without even touching the ground. This image segmentation challenge can help to significantly improve the data quality by considering land use and road-type effects. 

      The bigger problem: Weather and climate models are vitally important for science and society to predict floods, droughts, heatwaves and agricultural yield. Unfortunately, all those models are highly uncertain about the water content in the soil, although soil moisture is a dominant factor in the hydrological cycle: it controls evaporation, infiltration, root uptake, and runoff. Hence, real observations of soil moisture at the regional scale are desperately needed to support those models. While point measurements represent only a few meters, and satellites cannot see below the surface, cosmic rays are here to help! 

      Every minute, cosmic radiation from outer space hits the Earth’s atmosphere and partly turns into neutrons. These particles are almost unstoppable, they can penetrate the soil down to one meter and reflect back to the atmosphere. Only hydrogen can efficiently stop and absorb neutrons, due to their similar masses, and consequently we can find more neutrons over dry soil, and less neutrons over wet soil. NASA makes use of this principle to detect water on Mars by counting reflected neutrons with satellites. 

      Meanwhile,  Martin Schrön and his team from the Helmholtz Centre for Environmental Research (UFZ) drives  cosmic-rayneutron detectors in cars and trains across the country to detect soil moisture on Earth. However, dry roads and hydrogen-rich vegetation does influence the signal significantly. So they also take pictures every 20 seconds along the way to provide additional information, for instance whether they’re travelling on a paved road or passing a dense forest. This data can also be used to validate remote-sensing products that aim to detect vegetation status and land use.  Since the measurement data should be processed automatically to provide near-realtime information for weather and climate models, the efficient interpretation of the drive-by photography certainly is a challenge of a higher level. And this is where you come in. 

      The data science challenge: Martin and his team challenge you to develop a self-improving computer model that reliably detects landscape features in a set of images they’ve collected on their rides through Germany. Your image segmentation model should be able to identify objects in front of the car that are closer than about 50 m, such as bare soil, rocks, vegetation types including fields, forests, grassland, or swamps, different types of roads (paved, unpaved, narrow or wide) and buildings. For each image, the output should identify feature classes, their size (in polygon shape), their approximate distance to the car and on which road side they were recorded. It should also provide uncertainty for these values. 

      The data: 5000 photographic images in JPG format, 1000 of which are labelled for training. 

      Helpful skills to solve this challenge: Data analytics, machine learning, cluster analysis, computational models, high-performance computing, image processing 

      Spot the mistake in ~50 million data points, cleverly

      Sensors measure climate data around the clock, but they’re not foolproof. In this challenge, you help scientists find mistakes by developing an automated troubleshooting method from big data.  

      The bigger problem: Four years ago, torrential rainfalls caused a mudslide in the small town of Altenroda, South of the Harz mountains in Saxony-Anhalt, Germany. Up to 100 litres of rain fell within a very short time, causing the slope above a riverbed to break away and shoot through the town as a mudslide. Property was damaged, but thankfully no one was hurt. 

      An hour drive North, scientists from the Umweltforschungszentrum Leipzig are studying hydrological processes that lead to such extreme weather events, and how climate change may influence the interplay of atmosphere, soil and water. The team around Corinna Rebmann have equipped a 10,000 square meter patch of forest with sensors to measure a multitude of parameters, among these soil moisture, air temperature and precipitation. But sensor batteries run out and other mistakes happen, and the scientists need to to more or less manually weed through huge amounts of data to find them. In this challenge, you can help Lennart Schmidt find a clever method for Corinna to automatically flag suspicious and bad data points in measurements at Hohes Holz and - depending on just how clever your solution is - at many other observational sites. 

      The data science challenge: The challenge is to develop a machine-learning model that performs automatic quality control of soil moisture and climate data from Hohes Holz. Quality control means assigning quality flags, i.e. "good", "suspicious" or "bad" to each of the data points and identifying erroneous values. The model should be able to reproduce the experts’ verdicts who have been flagging the data manually over the past few years.

      Generally, there are two ways to approach this problem: As manual flags are available, it could be treated as a supervised multi-class classification task. You could use the manual flags to train a model to reproduce exactly these. However, we require an unsupervised classification/anomaly detection model for two reasons: it  First, this allows us to use the algorithm at sites other than Hohes Holz that haven’t been manually flagged. Second, this offers the possibility of future retraining to adapt to changing climatic conditions. Adding to this, three aspects make this classical machine-learning problem interesting: 1) All data needs to be included at once, i.e. multiple sensors need to be flagged simultaneously based on their spatial and temporal correlation. 2) As the sensors are located at different depths, the model should be able to map time-lags between them. 3) The model has to be robust against missing values, e.g. if one of the sensors completely stops sending signals

      The data: 5-year-long time series of soil moisture, soil temperature and battery voltage from 240 sensors at 40 locations at the TERENO-site "Hohes Holz" with approximately 40 Million data points. 5-year-long time series of climatic conditions at the site, i.e. air temperature, precipitation and wind velocity with approximately 10 Million data points. This data is open and free to use after the challenge.

      Helpful skills to solve this challenge: Data analytics, Machine learning, Data visualization, High-performance computing.


      Prize for the winning team: The winning team will be acknowledged (e.g. as co-authors) in possible resulting publications.

      Developing reliable forecasts for drought 

      Work with simulations spanning several thousands of years to predict drought for the following winter season. 

      The bigger problem: During the severe summer drought of 2017, cereal production yields in Italy in Spain dropped to their lowest levels in 20 years, and almonds and olives shriveled away on their branches. The previous winter had been drier than usual; water reservoirs weren’t filled as highly and didn’t last through the warmer months. 

      This example showcases why correctly predicting winter precipitation  is important for the region that is likely to be stricken by drought more often in the future, according to the Intergovernmental Panel on Climate Change (IPCC). Help Eduardo Zorita and his team from the Helmholtz Zentrum Geesthacht to develop a model that looks at spring and summer, and reliably predicts rain and snow for fall and winter. 

      The data science challenge: The challenge is to develop a prediction method for precipitation from November through March for Southern Europe, based on data from April to October. Attack this problem in the virtual world of simulations with Earth System Models, providing comprehensive datasets over several thousands of years with ample opportunity to train machine learning methods. Identify large-scale climate patterns of sea-surface-temperature, sea-ice and snow cover, soil moisture through the summer and autumn seasons and predict accumulated precipitation in the winter as simulated by the model. Use your preferred machine learning approach to develop predictions and test them against the simulated values . 

      The data: Data of sea-surface temperatures, atmospheric circulation, sea-ice cover, snow cover from very long simulations (typically 1000 years) with Earth System Models covering the seasons from April to October of each year. The data are totally free to use, from the data repository of the IPCC.  These data are stored in netcdf format. Some slight familiarity in handling this format (e.g. numpy) is desirable but it is not a prerequisite.


      Helpful skills to solve this challenge: Data analytics, Machine learning, Data visualization, some interest in climate dynamics.

      HIDA Datathon details

      When? November 5-6, 2020

      Where? Berlin, Haus Ungarn, Karl-Liebknecht-Straße 9, 10178 Berlin

      Who? Data scientists and data scientists in training with working knowledge in one or more of the following or other relevant  skills: data analytics, scientific computation, machine learning, data visualization  

      Please find our preliminary agenda here.

      Do you have questions or anything else you would like to share with us in advance? We’re happy to hear from you. Please write us a note to: hida@helmholtz.de

      The registration is closed due to high registration numbers. Click here for the waiting list!

      https://bit.ly/34gHGYV

      We are aware that some of you are prohibited from travelling for business due to COVID-19. If you’re signed up and cannot take part in the Datathon for this or for any other reason, please let us know as soon as possible at hida@helmholtz.de. Thank you!

            

        Thank you to the Jülich Supercomputing Centre and the KIT Steinbuch Centre for Computing for supporting us with HPC-resources.

                                                                                                                                                                                               

        The following hotels offer accommodation at discounted rates:

        download