Spot the mistake in ~50 million data points, cleverly
Sensors measure climate data around the clock, but they’re not foolproof. In this challenge, you help scientists find mistakes by developing an automated troubleshooting method from big data.
The bigger problem: Four years ago, torrential rainfalls caused a mudslide in the small town of Altenroda, South of the Harz mountains in Saxony-Anhalt, Germany. Up to 100 litres of rain fell within a very short time, causing the slope above a riverbed to break away and shoot through the town as a mudslide. Property was damaged, but thankfully no one was hurt.
An hour drive North, scientists from the Umweltforschungszentrum Leipzig are studying hydrological processes that lead to such extreme weather events, and how climate change may influence the interplay of atmosphere, soil and water. The scientists have equipped a 10,000 square meter patch of forest with sensors to measure a multitude of parameters, among these soil moisture, air temperature and precipitation. But sensor batteries run out and other mistakes happen, and Lennart Schmidt’s team needs to more or less manually weed through huge amounts of data to find them. In this challenge, you can help Lennart find a clever method to automatically flag suspicious and bad data points in measurements at Hohes Holz and - depending on just how clever your solution is - at many other observational sites.
The data science challenge: The challenge is to develop a machine-learning model that performs automatic quality control of soil moisture and climate data from Hohes Holz. Quality control means assigning quality flags, i.e. "good", "suspicious" or "bad" to each of the data points and identifying erroneous values. The model should be able to reproduce the experts’ verdicts who have been flagging the data manually over the past few years.
Generally, there are two ways to approach this problem: As manual flags are available, it could be treated as a supervised multi-class classification task. You could use the manual flags to train a model to reproduce exactly these. However, we require an unsupervised classification/anomaly detection model for two reasons: it First, this allows us to use the algorithm at sites other than Hohes Holz that haven’t been manually flagged. Second, this offers the possibility of future retraining to adapt to changing climatic conditions. Adding to this, three aspects make this classical machine-learning problem interesting: 1) All data needs to be included at once, i.e. multiple sensors need to be flagged simultaneously based on their spatial and temporal correlation. 2) As the sensors are located at different depths, the model should be able to map time-lags between them. 3) The model has to be robust against missing values, e.g. if one of the sensors completely stops sending signals
The data: 5-year-long time series of soil moisture, soil temperature and battery voltage from 240 sensors at 40 locations at Hohes Holz with approximately 40 Million data points. 5-year-long time series of climatic conditions at the site, i.e. air temperature, precipitation and wind velocity with approximately 10 Million data points. This data is open and free to use after the challenge.
Helpful skills to solve this challenge: Data analytics, Machine learning, Data visualization, High-performance computing.
Prize for the winning team: The winning team will be acknowledged (e.g. as co-authors) in possible resulting publications.