Building and sharing Scientific Knowledge Graphs for Open Science

SciLake will develop, support, and offer customisable services to the research community following a two-tier service architecture.

TIER I

Offer a comprehensive, open, transparent, and customisable scientific data-lake-as-a-service, empowering and facilitating the creation, interlinking, and maintenance of SKGs both across and within different scientific disciplines.

TIER II

Build and offer a set of customisable, AI-assisted services that facilitate the navigation of scholarly content following a scientific merit-driven approach, focusing on two merit aspects which are crucial for the research community at large: impact and reproducibility.

Overall methodology

Co-design of services with use case partners

SciLake addresses current and anticipated needs of researchers by giving emphasis on elaborating and formalizing case studies early in the project to drive requirements elicitation, refinement of the designed KPIs, and testing, validation and evaluation. SciLake encompasses real-life case studies from 4 different scientific domains (neuroscience, cancer, transport and energy research) to ensure that the developed services will be designed, configured and customized to meet the needs and requirements of several large scientific communities, including cross-disciplinary research and collaboration. During the project, the use cases will be further elaborated, enriched and adjusted to account for new feedback from stakeholders.

Agile prototyping

SciLake will adopt an early/rapid prototyping methodology with short development cycles (release early/often), constant evaluation from domain experts, flexible experimentation with various ideas and directions, and increased collaborative activities. Design sketches and static mock-ups will be produced rapidly to test new concepts and receive critical feedback. Working prototypes with limited functionality (e.g., baseline algorithms, smaller data) will be rapidly developed to convey a more realistic impression of their operation. After feedback, the prototypes will be iteratively improved, until the design is gradually finalized, and the underlying algorithms are configured, customized and optimized. Finally, all software and interfaces will be integrated early in their lifecycle to provide a coherent version of the entire platform.

Pilot testing and feedback elicitation

We will elicit and welcome continuous feedback from domain experts through two main channels:

Pilots conducted in the project: They will offer the opportunity to validate and demonstrate services on real scenarios involving also researchers outside the Consortium, thus receiving feedback from a wider range of stakeholders and avoiding internal bias;
Publicly available online demos: Solicit feedback from a larger number and range of external stakeholders through electronic means (feedback form, issue tracker, email), as well as during presentations and demonstrations in various events.

Rely on existing infrastructure

SciLake will leverage the expertise and technology capacities of the consortium partners, extending the capabilities and improving the technology readiness level of existing software and resources. All SciLake components will start at high technology readiness level and are expected to reach TRL7 or higher by the end of the project. The following table summarises the most important components:

OpenAIRE Graph	The OpenAIRE Graph (ORG) is a service that populates and provides access to an SKG that includes metadata and links between scientific products (e.g. literature, datasets, software, "other research products"), organizations, funders, funding streams, projects, communities, and (provenance) data sources.
AvantGraph	AvantGraph is a next-generation knowledge graph analytics engine, building on state-of-the-art technologies in main-memory scale-up processing. Its innovations include optimization support for recursive analytics in industry standard query languages (e.g., openCypher, SPARQL, GQL); vectorized and compiled execution; worst-case optimal join processing; factorized intermediate result processing; and, temporal graph analysis.
R2PG-DM	R2PG-DM is the first direct mapping solution from the relational database model to the property graph data model facilitating modeling and representing graph-structured data collections for sophisticated graph analytics.
gMark	A basic component of a mature framework for knowledge graph analytics is tooling for benchmarking and experimental design. gMark is a domain- and query language-independent platform targeting highly tunable generation of both graph instances and graph query workloads based on user-defined schemas.
BIP! Toolbox	BIP! Toolbox is a suite of advanced software components and resources with the aim to assist the discovery of valuable publications, leveraging multi-perspective impact analysis of scientific publications
SciNeM	SciNeM is an open source, publicly available, scalable analysis tool for metapath-based knowledge discovery in Heterogeneous Information Networks.
VocTagger	VocTagger is a python library that indexes textual corpora in accordance with given controlled vocabularies, in a flexible and scalable manner.
SciNoBo	Fields of Study Classification System (FoS) is a suite of tools automatically assigning publications to scientific fields of study. The service is publicly available via OpenAIRE
RADS	The Research Artifacts Discovery System (RADS) is an automated Information Extraction (NLP/IE) system analysing scientific literature with the aim of spotting and isolating (descriptions of) research outputs (datasets, tools, software, protocols etc.). The system is smoothly integrated with the Fields of Science (FoS) classification and Citation Analysis systems.
ELG services	European Language Grid (ELG) is the primary platform for language technologies in Europe giving access to a multitude of related datasets and language processing services (among them machine translation services).
DSR	Document Structure Recogniser – Tool that parses and analyses scientific articles to identify and mark up their individual parts and sections. The structure annotations can be used for further downstream processing applications and use cases (a corresponding scientific publication is currently work in progress).
SciTagIT	A suite of Domain-specific Classifiers automatically assigning publications to categories leveraging domain ontologies and classification schemes (e.g. ICD-11, the Glossary for Transport statistics, UN SDGs).

Open Science practices

SciLake follows an “Open Science by design” implementation approach. SciLake’s outputs (publications, data, software, services, etc.) will be shared according to Open Access and FAIR principles, being in-line with the EOSC Interoperability Framework (OpenAIRE Guidelines for research products, EOSC-Enhance Service Description Templates for services).