Domain-Specific Machine Translation for SciLake

SciLake Technical components

Domain-Specific Machine Translation for SciLake

By Stefania Amodeo

In a recent webinar for SciLake partners, Sokratis Sofianopoulos and Dimitris Roussis from Athena RC presented their cutting-edge Machine Translation system, which will be integrated into the Scientific Lake Service.

The presentation emphasized the significance of gathering domain-specific parallel data to train machine translation systems. It specifically focused on the language pairs of French to English, Spanish to English, and Portuguese to English. Additionally, a demonstration of a few models created so far was showcased during the presentation.

This blog post recaps the key points from the webinar.

Jan 10, 2024

Challanges

Machine translation involves the automated process of translating text or speech from one language to another. The current state of the art utilizes deep learning approaches like sequence-to-sequence models and transformers. However, challenges arise when applying machine translation to specialized domains such as medical or scientific documents due to technical terms. The solution is to collect in-domain data and use them to fine-tune existing state-of-the-art machine translation models of the general domain for the specific language pair of interest.

Data collection

The team has focused on collecting domain-specific parallel data from the four SciLake pilot domains: Cancer Research, Energy Research, Neuroscience, and Transportation Research. Additionally, they have gathered general-purpose scientific data to ensure comprehensive coverage.

The data collection process involved downloading approximately 9.3 million records from 62 repositories, including theses, dissertations, and various scientific documents. The team meticulously parsed and classified the records, extracting metadata such as authors' names, abstracts, and titles. The result is a vast collection of parallel and monolingual sentences in English, Spanish, French, and Portuguese.

To ensure the quality of the machine translation systems, the team created benchmark test sets for each domain. These sets consist of 1000 parallel sentences, divided into developer and test sets. Additionally, a larger test set of 3000 parallel sentences was created for the general academic domain. These test sets allow for the evaluation of the fine-tuned models.

Translation Models

To enhance the performance of the machine translation models, the team utilized a combination of in-domain data and general academic data. Since in-domain data was limited, the team incorporated as much relevant data as possible to improve the performance of the general-purpose models. For language pairs such as French to English and Spanish to English, the team employed the Big Transformers architecture, which consists of deep learning models with approximately 213 million parameters. For Portuguese to English, a Base Transformers architecture with fewer (65 millions) parameters was used.

The initial results, reported in the table below, show that current models (fine-tuned with only in-domain data as well as fine-tuned with a combination of in-domain data as well as general academic data) perform reasonably well. The evaluation scores are based on two well-established metrics used in machine translation for the past decade: the BLEU and the COMET indices, which are computed by comparing the machine-generated translations with reference translations. Notably, the French to English system reported the lowest scores, likely due to the limited amount of data available for this specific language pair.

The results also show a significant improvement in scores after fine-tuning with both in-domain and academic corpus data. On average, mixing data resulted in a 2.5% increase in scores. The in-domain data contributed 1.5+ points to the scores, while the general academic data added almost an additional point.

How can we improve the results?

These findings are based on the first year of working with the fine-tuning process. The team has numerous ideas for further improving the results, including exploring multi-domain adaptation, finding more data sources, and using machine translation for back-translation to generate additional data. Additionally, they plan to integrate the machine translation models with the SciLake infrastructure and collaborate with project partners to maximize the potential benefits.

Demo

The team presented a web interface for translation requests and showcased the capabilities of the platform. Currently, this system is an internal prototype used for evaluating machine translation models and experimenting with various processes, such as generating text for future post-editing, and exploring new ideas. As an example, a Firefox plugin has been developed to allow users to request webpage translation using the Spanish model. This plugin is useful for translating Spanish articles while browsing.

Conclusions

Overall, the presentation offered valuable insights into the process of fine-tuning and evaluating machine translation models for specialized domains. With ongoing research and improvements, integrating domain-specific machine translation into the SciLake infrastructure holds great potential for enhancing scientific communication and collaboration.

machine-translation-news, training

SciNoBo: Science No Borders

SCILAKE TECHNICAL COMPONENTS

SciNoBo: Science No Borders

By Stefania Amodeo

In a recent webinar for SciLake partners, Haris Papageorgiou from Athena RC presented the SciNoBo toolkit for Open Science and discussed its benefits for science communities. SciNoBo, which stands for Science No Borders, is a powerful toolkit designed to facilitate open science practices.

In this blog post, we recap the key points from the webinar and explore the different functionalities offered by SciNoBo.

Dec 06, 2023

The Toolkit

The SciNoBo toolkit provides a comprehensive range of modules and functionalities to support researchers in their scientific endeavors. Let's take a closer look at each of these modules and their benefits:

Publication Analysis

Processes publications in PDF format and extracts valuable information such as tables, figures, text, affiliations, authors, citations, and references.
Field of Science (FoS) Analysis

Uses a hierarchical classifier to assign one or more labels to a publication based on its content and metadata. The hierarchical system consists of 6 levels, with the first 3 levels being standard in the literature. This approach allows to adhere to the well-established taxonomies in scientific literature while also capturing the dynamics of scientific developments at levels 5 and 6, where new topics emerge and some fade out (see image below).
Collaboration Analysis

Analyzes collaborations between fields and identifies multidisciplinary papers. Provides insights and indicators to help researchers understand the interdisciplinarity of a publication and joint efforts of researchers from different disciplines.
Claim/Conclusion Detection

Detects claims and conclusions in scientific publications, providing insights to analyze disinformation and misinformation. Helps identify if news statements are grounded in scientific terms and can collect claims and conclusions from different papers.
Citation Analysis

Aggregates conclusions from various sources, aiding researchers in conducting surveys on citation analysis. Facilitates a comprehensive understanding of how the scientific community adopts or builds upon previous findings.
SDG Classification

Categorizes publications and artifacts based on the Sustainable Development Goals (SDGs). It is a multi-label classifier that assigns multiple labels to a publication, allowing researchers to align their work with specific SDGs.
Interdisciplinarity

Explores research classification at various levels and highlights interdisciplinary aspects. Helps identify collaborations across different fields.
Bio-Entity Tagging

Extracts and annotates health publications based on bio entities like genes or proteins. Helps identify and analyze relevant biological information.
Citance Semantic Analysis

Analyzes statements on previous findings in a specific topic. Assesses the scientific community's adoption or expansion of these conclusions, helping researchers understand the endorsement or acceptance of previous research.
Research Artifact Detection

Extracts mentions and references of research artifacts from publications. The type of artifact extracted depends on the specific domain, such as software (computer science), surveys (social sciences, humanities), genes or proteins (biomedical sciences). The goal is to accurately extract all mentions and find all the metadata that coexist in the publication. From there, we can build a database or a knowledge graph that includes all of these artifacts.

Connection to SciLake Communities

The SciNoBo toolkit aims to remove barriers in the scientific community by providing a collaborative and intuitive assistant. It utilizes powerful language models and integrates with various datasets, including OpenAIRE, Semantic Scholar, and citation databases. Researchers can interact with the assistant using natural language, asking scientific questions and receiving insights from the available modules.

One of the main features of SciNoBo is its ability to help users find literature related to specific research areas or topics within a given domain. The platform provides a list of entries ranked by significance, along with their associated metadata. This allows researchers to easily access relevant publications and explore the research conducted in their field.

Once researchers have identified publications of interest, SciNoBo offers a wide range of functionalities to support their analysis. Users can explore the conclusions, methodology, and results of specific papers, and even read the full paper. The platform also enables users to analyze research artifacts mentioned in the papers, such as databases, genes, or medical ontologies. By examining the usage, citations, and related topics of these artifacts, researchers can gain a deeper understanding of the research landscape in their chosen field.

Each pilot project utilizes a different branch of the hierarchy to narrow down the publications that users may want to further analyze. Here are some examples of possible applications:

The Cancer pilot can create a Chronic Lymphocytic Leukemia (CLL) specific knowledge graph (see image below).
The Transportation pilot can identify publications that examine "automation" in transportation domains.
The Energy pilot can identify publications that examine "photovoltaics" and "distributed generation".
The Neuroscience pilot can identify publications that examine "Parkinson's disease" and "PBM treatment".

Six levels of the Field of Science classification system for the Cancer pilot use case

The platform equips researchers with the tools and functionalities to ask any type of question and receive insights based on their collected data. By using augmented retrieval technology and feeding language models with the collection of publications, SciNoBo ensures accurate and relevant results.

Furthermore, SciNoBo allows users to create their own collections of publications and save their results. This feature enables researchers to build their own knowledge graph and share their findings with the scientific community. By collaborating and expanding on each other's work, users can collectively develop a comprehensive understanding of their respective fields.

Conclusion

In conclusion, the SciNoBo platform is a valuable resource for science communities engaged in open science practices. With its wide range of tools and functionalities, researchers can explore and analyze publications, classify research fields, detect claims and conclusions, analyze citations, and classify publications. By leveraging the power of large language models and access to diverse data sources, SciNoBo provides an intuitive and immersive platform for researchers to interact with the scientific community and gain valuable insights from scientific literature.

scinobo-articles, training

Read more …SciNoBo: Science No Borders

SciLake 2nd Plenary Meeting

Consortium meeting

SciLake 2nd Plenary Meeting

By Stefania Amodeo

The SciLake team met in Barcelona and online on November 9-10, 2023. The meeting, hosted by SIRIS Academic, provided an opportunity to review the progress made in the past year and plan future work.

Nov 20, 2023

This blog post gives a summary of the important topics discussed during the meeting, including the main goals and vision of the project, the challenges for the upcoming months, and the expectations for the pilot projects.

SciLake Main Motivation

The main motivation behind the SciLake project is to address the challenges of combining domain knowledge with open Scientific Knowledge Graphs (SKGs) and build valuable added-value services tailored for specific domains. This combination is hampered by various issues related to the ways domain-specific knowledge is organized (e.g., fragmentation, heterogeneous formats, multilingual texts, and interoperability issues with domain-agnostic SKGs).

"The project goal is to overcome these challenges and create a seamless integration between domain knowledge and open SKGs, ultimately empowering researchers and fostering a more interconnected and efficient scientific community." - Thanasis Vergoulis, SciLake coordinator

SciLake Vision

SciLake is developing a user-friendly “Scientific-Lake-as-a-Service” that is open, transparent, and customizable. This service, built upon the OpenAIRE Graph, will host both domain-specific and general knowledge, making it easier for communities to create, connect, and maintain their own SKGs, while also offering a unified way to access and search the respective information. The project is also developing two specialised services on top of the Scientific Lake: one to assist users in navigating the respective vast knowledge space by exploiting indicators of scientific impact, and another to improve research reproducibility in specific research domains. Finally, real-world pilot tests will be conducted to customise, test, and showcase these services in practice.

Challenges for Next Months

In the upcoming months, the project will focus on understanding the specific needs of the pilots in order to tailor the SciLake services effectively. Roadmaps will be developed for each component of the SciLake services, leading up to their alpha release in June 2024. The release will include comprehensive documentation and demos of each component.

Pilots Role

Pilots in the SciLake project play a crucial role in identifying relevant datasets, texts, knowledge bases/graphs, and ontologies for their domains. They also provide feedback on graph querying, knowledge discovery, and reproducibility requirements. Each pilot will create and update a domain-specific knowledge graph, while demo use cases will be used to test and evaluate the SciLake components for further refinement and improvement.

The SciLake plenary meeting in Barcelona was a productive gathering where the team reviewed their progress and outlined the future plans for the project. Overall, the SciLake project is making significant strides in bridging the gap between domain knowledge and open SKGs, bringing us closer to a more interconnected and efficient scientific community.

Stay tuned for more updates on the progress of SciLake!

Read more …SciLake 2nd Plenary Meeting

AvantGraph: the Next-Generation Graph Analytics Engine

SCILAKE TECHNICAL COMPONENTS

AvantGraph: the Next-Generation Graph Analytics Engine

By Stefania Amodeo

In a webinar for SciLake partners, Nick Yakovets, Assistant Professor at the Department of Mathematics and Computer Science, Information Systems WSK&I at Eindhoven University of Technology (TU/e), introduced AvantGraph, a next-generation knowledge graph analytics engine. Yuanjin Wu and Daan de Graaf, graduate students in Nick's research group, presented a demo of the tool.

Developed by TU/e researchers, AvantGraph aims to provide a unified execution platform for graph queries, supporting everything from simple questions to complex algorithms. In this blog post, we will delve into the philosophy behind AvantGraph, its query processing pipeline, and its impact on graph analytics.

Nov 2, 2023

The Philosophy: Questions over Graphs

The fundamental purpose of a database is to answer questions about data. For a graph database like AvantGraph, the focus is on asking questions over graphs. We can categorize these questions based on their expressiveness, and the degree to which databases can optimize their execution. Expressiveness refers to the richness and difficulty of the questions being asked, while optimization refers to how easy or difficult it is for databases to answer these questions. Based on this categorization, the range of questions that can be asked over graphs varies in complexity, as shown in the graphic below, from simple local look-ups to general algorithms that introduce iterations:

Local look-ups (e.g., the properties of data associated with a full text)
Neighborhood look-ups
Subgraph isomorphism (matching specific patterns of the graph)
Recursive path queries (introducing connectivity)
General algorithms (introducing iterations, e.g., PageRank)

Optimization level as a function of questions’ complexity.

AvantGraph aims to cover this full spectrum of questions, allowing users to optimize the execution of their queries and explore the richness of their data. It utilizes cutting-edge technologies to enable efficient processing of very large graphs on personal laptops.

AvantGraph Query Processing Pipeline

AvantGraph query processing pipeline, adapted from DOI:10.14778/3554821.3554878

AvantGraph employs a standard database pipeline. It supports query languages like Cypher and SPARQL, and it features three additional main components to enable the execution of complex questions like algorithms:

the QuickSilver execution engine, a multi-thread execution system allowing for efficient query parallelization and hardware utilization;
the Magellan Planner, a query optimizer that returns efficient execution plans tailored to each query, taking into account the recursive and iterative nature of graph queries;
the BallPark cardinality estimator, a cost model that determines the best execution plan for different circumstances, optimizing query performance.

In addition, AvantGraph supports secondary storage, utilizing both memory and disk effectively. This allows it to process very large graphs on laptops without requiring excessive amounts of RAM.

Preparations for SciLake Pilots

As part of the SciLake project, AvantGraph is being extended with powerful data analytics capabilities and novel technologies to support research communities in defining graph algorithms.

Why do we need it?

Graph query languages such as Cypher or SPARQL are specifically designed for "subgraph matching". This makes them highly effective when you need to retrieve information such as "get me the neighbors of a specific node" or "find the shortest path between two nodes in the graph". However, unfortunately, these query languages are too limited for complex graph analytics like e.g., PageRank.

Traditional solutions to this issue involve the database vendor providing a library of built-in algorithms that can be applied to the graph. While this works well if the library includes the algorithm needed to solve the problem, it cannot accommodate simple variations or fully custom algorithms.

What AvantGraph offers

AvantGraph introduces Graphalg, a programming language designed specifically for writing graph algorithms. Graphalg is fully integrated into AvantGraph, meaning, for example, that it can be embedded into Cypher queries.

The language used in Graphalg is based on linear algebra, which makes the syntax and operations easy to learn. The goal for Graphalg is to be a high-level language that is both user-friendly and efficiently executed by a database. This is achieved by transforming queries and Graphalg programs into a unified representation that can be optimized effectively. This enables optimizations that cross the boundary between query and algorithm, that would not otherwise be possible.

AvantGraph supports the client-server model, which is commonly used by most modern database engines, including Postgres, MySQL, Neo4j, Amazon Neptune, Memgraph, and more. This allows AvantGraph databases to be queried through more than just a Command Line Interface.

As of now, AvantGraph databases can be queried from most major programming languages, including Python API, and will be expanded in the future with more algorithms and functionalities.

Conclusion

AvantGraph represents a significant advancement in knowledge graph analytics. By addressing the limitations of traditional graph query languages and introducing Graphalg, AvantGraph empowers users to perform complex graph analytics with ease. Its unified execution of simple questions to general algorithms, coupled with its efficient query processing pipeline, makes it a valuable tool for researchers and data scientists. As AvantGraph continues to evolve and gain traction within the research community, we can expect to see exciting advancements in graph analytics and a deeper understanding of complex data relationships.

Learn more

AvantGraph is presented in:

Leeuwen, W.V., Mulder, T., Wall, B.V., Fletcher, G., & Yakovets, N. (2022). AvantGraph Query Processing Engine. Proc. VLDB Endow., 15, 3698-3701.

DOI:10.14778/3554821.3554878

For more information about AvantGraph and its publications, visit https://avantgraph.io/.

AvantGraph will be released under an open license soon. To test its functionalities and perform graph queries, check out the docker container available on GitHub at https://github.com/avantlab/avantgraph/.

avantgrapharticle, training

Defining the Roadmap for a European Cancer Data Space

WORKSHOP

Defining the Roadmap for a European Cancer Data Space

By Stefania Amodeo

SciLake representatives participated to the EOSC4Cancer consultation to define a Roadmap for a European Cancer Data Space. This article recaps the key points from the workshop.

Oct 31, 2023

EOSC4Cancer, the European-wide foundation to accelerate data-driven cancer research, recently held a face-to-face workshop in Brussels to define a Roadmap for a European Cancer Data Space. SciLake representatives participated in a discussion with around 30 stakeholders from various sectors including research, industry, patient care, survivor groups, and EOSC. The discussion focused on key aspects for a sustainable cancer dataspace, such as access models, governance, data quality, security, and privacy.

The insights collected will be used in the creation of a roadmap, scheduled for publication in early 2025. This roadmap aims to shape the future of the European cancer dataspace, with policy recommendations to the European Commission.

EOSC4Cancer Objectives

EOSC4Cancer is a Europe-wide initiative that aims to accelerate data-driven cancer research. Launched in September 2022, this 30-month project will provide an infrastructure to exploit cancer data. It brings together comprehensive cancer centers, research infrastructures, leading research groups, and major computational infrastructures from across Europe.

The expected outcomes include a platform designed for storing, sharing, accessing, analyzing, and processing cancer research data. This involves interconnecting and ensuring interoperability of relevant datasets, as well as providing scientists with easy access to cancer research data and analysis systems. Additionally, the initiative will contribute to the Horizon Europe EOSC Partnership and other partnerships relevant to cancer research.

A Federated Digital Infrastructure for accelerating Cancer Research

The EOSC4Cancer Roadmap envisions a federated digital infrastructure to accelerate cancer research in the EU. It will rely on existing European and national structures and serve as the digital data infrastructure for the future Virtual European Cancer Research Institute, a platform enabling storage, sharing, access, analysis, and processing of research data. The structure will align with major European efforts for enabling the use and re-use of cancer-related data for research and will accommodate varying levels of maturity across member states. The federated infrastructure will include centralized components and capabilities, remote software execution, and a Cancer Research Commons repository. Each EU Member State is expected to have a National Data Hub for Cancer-related data and other relevant structures, such as national nodes and reference hospitals.

The plan is to develop in stages, first focusing on National Data Hubs, then involving Competence Centers and other participants. The National Data Hubs will reflect the structure of the Digital Hub. They will host a national database for cancer research data, allow the use of Research Environments, provide computing power, and coordinate outreach efforts.

Roundtable Discussions

The workshop held roundtable discussions on four topics: missing data types, missing data sources, requirements for national nodes, and any other element missing from the roadmap.

The first topic focused on identifying missing data types, such as synthetic data, specified clinical data, reference data, specified clinical data, social data, patient-generated data, and epidemiology of survivors. The discussion also emphasized the importance of preparing data for future use, connecting different types of data across various platforms, and ensuring that data comes from reliable sources by implementing quality checks and guidelines for data submission.

The second topic revolved around the data sources, discussing what is missing in the long list of data sources, how trusted they are, and how they can be connected and prioritized. The key sources identified include structural biology data, patient-generated data, demographic data, complete clinical trial datasets, and biobanks.

The third topic concerned the requirements for national nodes. The discussion recognized that these nodes should be flexible and adaptable to the needs of different countries. It was agreed that coordinating existing initiatives, while not necessarily simple, is preferable to creating something new.

In conclusion, additional elements for the roadmap were identified. These relate to the patient journey, the treatment and product development journey, and non-tangible aspects such as data governance, clinical data usage, and cybersecurity. The roadmap should also consider incentives for scientists, communication strategies, and a plan for the platform's sustainability.

Conclusion

The workshop was a significant step towards defining the Roadmap for a European Cancer Data Space. The insightful discussions and collaborative efforts of the participants identified the requirements for comprehensive cancer data, trusted data sources, and functional national nodes.

Through these collaborative efforts, the EOSC4Cancer initiative is paving the way for a more data-driven and interconnected future in cancer research across Europe.