Welcome to the website of the JeDaSS poster @ESWC_2021!

Towards Scientific Data Synthesis Using Deep Learning and Semantic Web

We aim to develop a new approach combining semantic web and machine learning to👇

categorize a given dataset into a domain topic and,
extract hidden links between its data attributes and data attributes from other datasets

Motivation

The Collaborative Research Center (CRC) AquaDiva is a large collaborative project spanning a variety of domains including biology, geology, chemistry, and computer science with the common goal to better understand the Earth’s critical zone.
Datasets collected within AquaDiva, like those of many other cross-institutional, cross-domain research projects, are complex and difficult to reuse since they are highly diverse and heterogeneous.
This limits the dataset accessibility to the few people who were either involved in creating the datasets or have spent a significant amount of time aiming to understand them.
Furthermore, more time is needed to figure out the major theme of unfamiliar datasets.

We believe that dataset analysis and summarization can be used as an elegant way to provide a concise overview of an entire dataset.

Definitions

A dataset (DS) is defined as a tuple <PD, MD> of primary data (PD) and metadata (MD) organized for a specific purpose.
PM represents actual data organized according to a specific structure, called data structure (DT)
DT consists of a set of data attributes (i.e. DT={da1, da2,…,dtn})
Each data attribute has a name, datatype, (optional) unit, description, as well as annotation based on a domain ontology.
Each tuple in the primary data is a collection of data cells containing data values (called data points).
MD contains information about,e.g., the data owner, data curators, the methodology used to produce primary data, etc.

In our implementation, almost all data attributes of available datasets are annotated using the AquaDiva ontology (ADOn) as the domain-specific ontology.

Methodology

Developing new data analysis and summarization approach
- combining semantic web and machine learning approaches
- Semantically classifies data attributes of scientific datasets (tabular data)
- This classification contributes to summarizing individual datasets, but also to link them to others.
The proposed approach has two main phases:
- off-line: to train and build a classification model using supervised deep learning using convolution layers and
- on-line: making use of the pre-trained model to classify datasets into the learned categories.

Data preparation

Proposing a new structure capturing several features from the dataset into a single container
As data attributes are the most important parts of the dataset, gathering all information related to data attributes
Considering for each data attribute the name, datatype, unit, data points attached to the data attribute as well as its semantic annotation.

### Image generation

Proposing a method to transform the constructed new structure into a number of images, where a set of images is generated for each data attribute
As shown below in the Figure, the ”Airtemperaturemean” data attributes from the ” Weather and soil data monitoring” along with its annotation, data type (decimal), unit(Celsius), and 30 data points

Classification

Currently using the ResNet18 convolutional neural network to build the classification model
Building and testing the proposed approach using datasets of the AquaDiva data portal.
- Using 114 datasets representing different domain topics, such as weather monitoring, groundwater hydrochemistry, gene abundance, or soil physical parameters
- 70% of these were used for training and 30% for evaluation
- the total number of data attributes is 1300
- the number of data points within a dataset ranges from 300 (5 data attributes×60 tuples) to 12,000,000

## Prelimanry results

Availability

The resources related to the development of the proposed approach can be found at the GitHub

People

Alsayed Algergawy, Heinz-Nixdorf Chair for Distributed Information Systems, University of Jena
Hamdi Hamed, Heinz-Nixdorf Chair for Distributed Information Systems, University of Jena
Birgitta König-Ries, Heinz-Nixdorf Chair for Distributed Information Systems, University of Jena

Acknolwedgement

This work has been funded by the Deutsche Forschungsgemeinschaft(CRC AquaDiva, Project 218627073)

JeDaSS_2021_poster