Smart reproducibility assistance service
This service assesses the reproducibility of published research based on textual information provided by article authors. By analyzing data, metadata, and additional research objects associated with the publications, this service provides quick feedback to authors about the replication potential of their work.
The service has the following components
Research object recognition in textual data
Using Natural Language Processing (NLP) and information extraction tools, this component will automatically identify and isolate research artifacts from a collection of publications. The focus is on the datasets and software mentioned in the publications: tools, services, applications, plugins, etc., but the list will grow in response to specific use cases. The service will automatically extract additional metadata, including the URL of the repository hosting the resource, the name of the development team/organization, acronyms, and licenses. There will be records linking research objects to publications. In addition, the service will further refine its NLP technologies to identify concepts relevant to the different use cases (e.g., specific diseases or gene names) from textual documents included in the lake, based on relevant vocabulary, taxonomies, and ontologies.
Research object link recommendation
This component consists of an algorithm that recommends links between different research objects (e.g., publications, datasets, software), encoded in a Scientific Knowledge Graph (SKG). It is possible that some links exist, but may not be recorded in the collected data (e.g., some software or datasets may have been released after publication, thus the paper does not include a reference to them). Additionally, there is a possibility that links that do not currently exist may be formed in the future (e.g., replicating the results of a paper using a different, but similar dataset). As a result of this tool, such links will be identified and recommended automatically. This is done by relying on algorithms that consider the content of the research objects as well as their topology and context in the knowledge graph. In the first case, multiple similarity measures will be considered and combined for different types of entities and attributes. In the latter case, graph embeddings will be utilized to analyze heterogeneous networks.
Article segmentation for multi-lingual articles
This component enables the segmentation of scientific articles into their individual sections/chapters allowing advanced information and knowledge extraction methodologies to identify the relevant information at the right place in a paper. For example, information about evaluation metrics is expected in the “Evaluation” section, though this section may have a different name. As a result of the tool being developed, it will be possible to generalize over a wide range and variety of section level titles that appear in the literature. The tool will be developed to process articles in several languages with a focus on English.
Citation-context assisted replication assessment
This component will develop tools for the collection and analysis of all citation statements, referred to as "citances", which are the text spans or sentences surrounding citations. It aims is to differentiate and quantify the reusability of research artifacts, such as datasets, software, methods, and the several aspects of reproducibility (i.e., generalization, robustness, replicability) according to Whitaker's matrix of reproducibility. With this tool, we will cater to the multilingual nature of scientific content as well as situations where existing approaches to citances extraction are ineffective (e.g., language-dependent methods) or have poor results (e.g., due to a lack of training data). To accomplish this, the tool will incorporate machine translation (MT) as part of the content processing pipeline.