Skip to main content

Unlocking insights in Cancer Research through Knowledge Graphs

Case Study

Unlocking insights in Cancer Research through Knowledge Graphs

By Stefania Amodeo

In a recent meeting with the EOSC4Cancer Cancer Landscape Partnering (CLP), SciLake took center stage as it introduced its vision and roadmap for unlocking insights in cancer research. Project coordinator, Thanasis Vergoulis (Athena RC), and Leily Rabbani, bioinformatician at the Department of Molecular Medicine and Surgery in Karolinska Institute, discussed the ongoing work towards creating a Cancer Research Knowledge Graph. This innovative tool will provide context and connections for what is known about specific research questions, helping researchers as they design new experiments.

The SciLake Cancer Research pilot involves the Institute of Applied Bioscience (INAB-CERTH) in Greece and Karolinska Institutet in Sweden. Focused on meeting the needs of researchers and clinicians, the project aims to harness the wealth of information available in public resources to address ongoing research questions. 

The ultimate goal? To deepen our understanding of the molecular biology and immunopathology of Chronic Lymphocytic Leukemia (CLL) and study the potential effects of different mutations.

With the assistance of SciLake technical partners, members of the pilot project are utilising advanced algorithms to discover new insights from the knowledge graph. For example, one interesting question they are exploring is, "how might a specific genetic mutation forecast a patient's overall health status and what insights might related literature offer in this regard?"

Chronic Lymphocytic Leukemia (CLL)

  • Characterized by the accumulation of neoplastic B cells in the bone marrow, blood cells, and secondary lymph nodes.

  • Patients can have a very diverse genetic landscape leading to heterogeneous clinical outcomes. This means progression rates and responses to drugs can vary greatly among patients.

  • The most common type of leukemia in adults.

  • Currently incurable.

Knowledge Graph: Benefits

The use of a scientific knowledge graph offers several benefits. It empowers research in precision medicine and diagnostics by facilitating the discovery of potential associations between identified biomarkers and other elements, such as genes, biological or functional pathways, and drugs. Furthermore, it is easily deployable and flexible, capable of integrating data from various sources, thereby offering a comprehensive view of the research landscape.


Developing a knowledge graph comes with its own set of challenges. The objective is to provide tools for creating and enriching the graph, and the primary concern is extracting latent knowledge to create the graph. Another significant challenge is establishing a common language among people of different expertise, such as clinicians and technical developers. This is crucial to facilitate effective communication and collaboration in the development and application of the knowledge graph. Finally, an important step to validate the graph involves manual curation to assess hidden associations and existing connections and ensure they are relevant to the specific biology experiment.

Where are we

The development of the knowledge graph is progressing by leveraging several pre-existing state-of-the-art knowledge graphs. One is PrimeKG, which is used to query networks of genes or proteins connected to a specific disease. For example, the graph shows connections between CLL and TP53, a gene known for its potential to increase the risk of various forms of cancer significantly when altered. Other larger state-of-the-art knowledge graphs, based on various biomedical databases, are also being integrated along with Prime KG. This strategy aims to capitalise on a broader set of databases and underlying connections, potentially uncovering new missing links. An example of this is the revealed relation between CLL and the gene SOD1, known for being overexpressed in many human cancers.

A variety of knowledge graphs exist, each drawing from a different biomedical source, and we can collect more information through their combination. In fact, many details are unique to a particular graph and there is minimal overlap between them.

Our data flow involves using a variety of knowledge graphs, including those previously mentioned, along with different ontologies and other data sources. We utilize tools provided by SciLake to establish connections among them and generate a comprehensive cancer knowledge graph.

Dataflow towards a CLL KG


The ultimate vision is to create a comprehensive network of interconnected nodes and relationships. These nodes can represent various entities such as institutions, grants, patents, publications, software, anatomical structures, diseases, drugs, compounds, gene targets, and many more. These relationships can take on different forms and can signify different types of connections such as mentions, associations, or other types of relationships. By creating this extensive web of connections, the network can be navigated and queried in a semi-automated manner to answer specific research questions. Moreover, the impact and reproducibility analysis services offered by SciLake can be utilized to prioritize findings.

This approach will enhance the understanding of the molecular biology and immunopathology of Chronic Lymphocytic Leukemia. It will also assist in studying the potential effects of different mutations. The advantage of this method lies in its ability to incorporate information from multiple sources simultaneously, offering a comprehensive and insightful analysis.


Read more …Unlocking insights in Cancer Research through Knowledge Graphs

Open Scholarly Communication: Insights from the EOSC Winter School 2024


Open Scholarly Communication: Insights from the EOSC Winter School 2024

By Stefania Amodeo

Open Scholarly Communication (OSC) is an important and evolving field that is gaining increasing attention in the scientific community. During the recent EOSC Winter School 2024, SciLake representatives and other leading experts gathered to discuss various aspects of OSC, from its definition to the challenges it faces, and its future trajectory. This blog post summarizes the key takeaways from this session.

Defining Open Scholarly Communication

The session started with a detailed exploration of the definition of open scholarly communication. The definitions provided by various entities such as SPARC, DOAJ, OpenAIRE, and MIT Libraries were examined. OSC was collectively emphasized as a process where research outputs are shared and disseminated openly and freely without barriers, enabling the democratisation of knowledge. The discussion highlighted that the research lifecycle (researching-writing-publishing-assessing) begins with infrastructures. Among these, OpenAIRE is a non-profit partnership of 50 organizations with the mission to establish a permanent open scholarly communication infrastructure to support European research. In addition to its infrastructure and services, the ongoing CRAFT-OA project aims to consolidate the Diamond Open Access publishing landscape and integrate it with EOSC and other large-scale aggregators. OPERAS is another infrastructure spanning 27 domains, primarily covering the EU but also involving international collaborations, with the goal of becoming an ERIC by 2027.

Challenges in Open Scholarly Communication

While OSC holds great promise for the future of knowledge dissemination, it does come with its challenges. These include the interdisciplinary nature of research, different research methods, a variety of publication formats, the need for multilingualism, the existence of numerous open access initiatives, and fragmentation of actors. The session also highlighted the issue of volume limitation for the number of publications processed per month and the challenge of achieving true interoperability. Despite having technical standards in place, achieving real interoperability requires a lot of manual work and collaboration between stakeholders, and resistance to change often hinders this process.

SciLake's Role

SciLake is a comprehensive project designed to assist scientific communities in constructing their own scientific knowledge graphs (SKGs). One of its primary challenges is enabling interoperability between domain-specific research and domain-agnostic databases such as the OpenAIRE Graph. The project is in line with the EOSC interoperability framework and actively collaborates with other scientific knowledge graphs to define core model entities through the Research Data Alliance (RDA) SKG Interoperability Framework.

The Future of Open Scholarly Communication

Looking ahead, the discussions pointed to the need for a clearer understanding of EOSC's scope. There was a suggestion to consider running the Diamond model as part of EOSC. This model, which combines open access, no author processing charges (APC), and scholarly ownership, was seen as a potential route to facilitate open scholarly communication. The future of OSC also hinges on improved interoperability. Even though technical solutions and platforms exist, the human resources level remains a challenge. The session concluded with a proposal to form a Task Force on Open Scholarly Communication. This team will focus on concrete actions that require resolution and cannot be addressed within a single project.


The EOSC Winter School 2024 provided a valuable platform for in-depth discussions on Open Scholarly Communication, fostering an understanding of its definition, evaluating the challenges it faces, and envisioning its future. It became clear that while OSC is a powerful tool for democratizing knowledge and enhancing research impact, it requires strategic interventions to overcome challenges, particularly in terms of interoperability. As the field continues to evolve, the scholarly community looks forward to continued dialogue and progress in OSC.

Read a full recap of the event here.


Read more …Open Scholarly Communication: Insights from the EOSC Winter School 2024

Domain-Specific Machine Translation for SciLake

SciLake Technical components

Domain-Specific Machine Translation for SciLake

By Stefania Amodeo

In a recent webinar for SciLake partners, Sokratis Sofianopoulos and Dimitris Roussis from Athena RC presented their cutting-edge Machine Translation system, which will be integrated into the Scientific Lake Service.

The presentation emphasized the significance of gathering domain-specific parallel data to train machine translation systems. It specifically focused on the language pairs of French to English, Spanish to English, and Portuguese to English. Additionally, a demonstration of a few models created so far was showcased during the presentation.

This blog post recaps the key points from the webinar.


Machine translation involves the automated process of translating text or speech from one language to another. The current state of the art utilizes deep learning approaches like sequence-to-sequence models and transformers. However, challenges arise when applying machine translation to specialized domains such as medical or scientific documents due to technical terms. The solution is to collect in-domain data and use them to fine-tune existing state-of-the-art machine translation models of the general domain for the specific language pair of interest.

Data collection

The team has focused on collecting domain-specific parallel data from the four SciLake pilot domains: Cancer Research, Energy Research, Neuroscience, and Transportation Research. Additionally, they have gathered general-purpose scientific data to ensure comprehensive coverage.

The data collection process involved downloading approximately 9.3 million records from 62 repositories, including theses, dissertations, and various scientific documents. The team meticulously parsed and classified the records, extracting metadata such as authors' names, abstracts, and titles. The result is a vast collection of parallel and monolingual sentences in English, Spanish, French, and Portuguese.

To ensure the quality of the machine translation systems, the team created benchmark test sets for each domain. These sets consist of 1000 parallel sentences, divided into developer and test sets. Additionally, a larger test set of 3000 parallel sentences was created for the general academic domain. These test sets allow for the evaluation of the fine-tuned models.

Translation Models

To enhance the performance of the machine translation models, the team utilized a combination of in-domain data and general academic data. Since in-domain data was limited, the team incorporated as much relevant data as possible to improve the performance of the general-purpose models. For language pairs such as French to English and Spanish to English, the team employed the Big Transformers architecture, which consists of deep learning models with approximately 213 million parameters. For Portuguese to English, a Base Transformers architecture with fewer (65 millions) parameters was used.

The initial results, reported in the table below, show that current models (fine-tuned with only in-domain data as well as fine-tuned with a combination of in-domain data as well as general academic data) perform reasonably well. The evaluation scores are based on two well-established metrics used in machine translation for the past decade: the BLEU and the COMET indices, which are computed by comparing the machine-generated translations with reference translations. Notably, the French to English system reported the lowest scores, likely due to the limited amount of data available for this specific language pair.

The results also show a significant improvement in scores after fine-tuning with both in-domain and academic corpus data. On average, mixing data resulted in a 2.5% increase in scores. The in-domain data contributed 1.5+ points to the scores, while the general academic data added almost an additional point.

How can we improve the results?

These findings are based on the first year of working with the fine-tuning process. The team has numerous ideas for further improving the results, including exploring multi-domain adaptation, finding more data sources, and using machine translation for back-translation to generate additional data. Additionally, they plan to integrate the machine translation models with the SciLake infrastructure and collaborate with project partners to maximize the potential benefits.


The team presented a web interface for translation requests and showcased the capabilities of the platform. Currently, this system is an internal prototype used for evaluating machine translation models and experimenting with various processes, such as generating text for future post-editing, and exploring new ideas. As an example, a Firefox plugin has been developed to allow users to request webpage translation using the Spanish model. This plugin is useful for translating Spanish articles while browsing.


Overall, the presentation offered valuable insights into the process of fine-tuning and evaluating machine translation models for specialized domains. With ongoing research and improvements, integrating domain-specific machine translation into the SciLake infrastructure holds great potential for enhancing scientific communication and collaboration.


Read more …Domain-Specific Machine Translation for SciLake

SciNoBo: Science No Borders


SciNoBo: Science No Borders

By Stefania Amodeo

In a recent webinar for SciLake partners, Haris Papageorgiou from Athena RC presented the SciNoBo toolkit for Open Science and discussed its benefits for science communities. SciNoBo, which stands for Science No Borders, is a powerful toolkit designed to facilitate open science practices. 

In this blog post, we recap the key points from the webinar and explore the different functionalities offered by SciNoBo.

The Toolkit

The SciNoBo toolkit provides a comprehensive range of modules and functionalities to support researchers in their scientific endeavors. Let's take a closer look at each of these modules and their benefits:

  • Publication Analysis

    Processes publications in PDF format and extracts valuable information such as tables, figures, text, affiliations, authors, citations, and references.

  • Field of Science (FoS) Analysis

    Uses a hierarchical classifier to assign one or more labels to a publication based on its content and metadata. The hierarchical system consists of 6 levels, with the first 3 levels being standard in the literature. This approach allows to adhere to the well-established taxonomies in scientific literature while also capturing the dynamics of scientific developments at levels 5 and 6, where new topics emerge and some fade out (see image below).

  • Collaboration Analysis

    Analyzes collaborations between fields and identifies multidisciplinary papers. Provides insights and indicators to help researchers understand the interdisciplinarity of a publication and joint efforts of researchers from different disciplines.

  • Claim/Conclusion Detection

    Detects claims and conclusions in scientific publications, providing insights to analyze disinformation and misinformation. Helps identify if news statements are grounded in scientific terms and can collect claims and conclusions from different papers.

  • Citation Analysis

    Aggregates conclusions from various sources, aiding researchers in conducting surveys on citation analysis. Facilitates a comprehensive understanding of how the scientific community adopts or builds upon previous findings.

  • SDG Classification

    Categorizes publications and artifacts based on the Sustainable Development Goals (SDGs). It is a multi-label classifier that assigns multiple labels to a publication, allowing researchers to align their work with specific SDGs.

  • Interdisciplinarity

    Explores research classification at various levels and highlights interdisciplinary aspects. Helps identify collaborations across different fields.

  • Bio-Entity Tagging

    Extracts and annotates health publications based on bio entities like genes or proteins. Helps identify and analyze relevant biological information.

  • Citance Semantic Analysis

    Analyzes statements on previous findings in a specific topic. Assesses the scientific community's adoption or expansion of these conclusions, helping researchers understand the endorsement or acceptance of previous research.

  • Research Artifact Detection

    Extracts mentions and references of research artifacts from publications. The type of artifact extracted depends on the specific domain, such as software (computer science), surveys (social sciences, humanities), genes or proteins (biomedical sciences). The goal is to accurately extract all mentions and find all the metadata that coexist in the publication. From there, we can build a database or a knowledge graph that includes all of these artifacts.

Connection to SciLake Communities

The SciNoBo toolkit aims to remove barriers in the scientific community by providing a collaborative and intuitive assistant. It utilizes powerful language models and integrates with various datasets, including OpenAIRE, Semantic Scholar, and citation databases. Researchers can interact with the assistant using natural language, asking scientific questions and receiving insights from the available modules.

One of the main features of SciNoBo is its ability to help users find literature related to specific research areas or topics within a given domain. The platform provides a list of entries ranked by significance, along with their associated metadata. This allows researchers to easily access relevant publications and explore the research conducted in their field.

Once researchers have identified publications of interest, SciNoBo offers a wide range of functionalities to support their analysis. Users can explore the conclusions, methodology, and results of specific papers, and even read the full paper. The platform also enables users to analyze research artifacts mentioned in the papers, such as databases, genes, or medical ontologies. By examining the usage, citations, and related topics of these artifacts, researchers can gain a deeper understanding of the research landscape in their chosen field.

Each pilot project utilizes a different branch of the hierarchy to narrow down the publications that users may want to further analyze. Here are some examples of possible applications:

  • The Cancer pilot can create a Chronic Lymphocytic Leukemia (CLL) specific knowledge graph (see image below).
  • The Transportation pilot can identify publications that examine "automation" in transportation domains.
  • The Energy pilot can identify publications that examine "photovoltaics" and "distributed generation".
  • The Neuroscience pilot can identify publications that examine "Parkinson's disease" and "PBM treatment".

Six levels of the Field of Science classification system for the Cancer pilot use case

The platform equips researchers with the tools and functionalities to ask any type of question and receive insights based on their collected data. By using augmented retrieval technology and feeding language models with the collection of publications, SciNoBo ensures accurate and relevant results.

Furthermore, SciNoBo allows users to create their own collections of publications and save their results. This feature enables researchers to build their own knowledge graph and share their findings with the scientific community. By collaborating and expanding on each other's work, users can collectively develop a comprehensive understanding of their respective fields.


In conclusion, the SciNoBo platform is a valuable resource for science communities engaged in open science practices. With its wide range of tools and functionalities, researchers can explore and analyze publications, classify research fields, detect claims and conclusions, analyze citations, and classify publications. By leveraging the power of large language models and access to diverse data sources, SciNoBo provides an intuitive and immersive platform for researchers to interact with the scientific community and gain valuable insights from scientific literature.


Read more …SciNoBo: Science No Borders

SciLake 2nd Plenary Meeting

Consortium meeting

SciLake 2nd Plenary Meeting

By Stefania Amodeo

The SciLake team met in Barcelona and online on November 9-10, 2023. The meeting, hosted by SIRIS Academic, provided an opportunity to review the progress made in the past year and plan future work.

This blog post gives a summary of the important topics discussed during the meeting, including the main goals and vision of the project, the challenges for the upcoming months, and the expectations for the pilot projects.

SciLake Main Motivation

The main motivation behind the SciLake project is to address the challenges of combining domain knowledge with open Scientific Knowledge Graphs (SKGs) and build valuable added-value services tailored for specific domains. This combination is hampered by various issues related to the ways domain-specific knowledge is organized (e.g., fragmentation, heterogeneous formats, multilingual texts, and interoperability issues with domain-agnostic SKGs).

"The project goal is to overcome these challenges and create a seamless integration between domain knowledge and open SKGs, ultimately empowering researchers and fostering a more interconnected and efficient scientific community." - Thanasis Vergoulis, SciLake coordinator

SciLake Vision

SciLake is developing a user-friendly “Scientific-Lake-as-a-Service” that is open, transparent, and customizable. This service, built upon the OpenAIRE Graph, will host both domain-specific and general knowledge, making it easier for communities to create, connect, and maintain their own SKGs, while also offering a unified way to access and search the respective information. The project is also developing two specialised services on top of the Scientific Lake: one to assist users in navigating the respective vast knowledge space by exploiting indicators of scientific impact, and another to improve research reproducibility in specific research domains. Finally, real-world pilot tests will be conducted to customise, test, and showcase these services in practice.

Challenges for Next Months

In the upcoming months, the project will focus on understanding the specific needs of the pilots in order to tailor the SciLake services effectively. Roadmaps will be developed for each component of the SciLake services, leading up to their alpha release in June 2024. The release will include comprehensive documentation and demos of each component.

Pilots Role

Pilots in the SciLake project play a crucial role in identifying relevant datasets, texts, knowledge bases/graphs, and ontologies for their domains. They also provide feedback on graph querying, knowledge discovery, and reproducibility requirements. Each pilot will create and update a domain-specific knowledge graph, while demo use cases will be used to test and evaluate the SciLake components for further refinement and improvement.

The SciLake plenary meeting in Barcelona was a productive gathering where the team reviewed their progress and outlined the future plans for the project. Overall, the SciLake project is making significant strides in bridging the gap between domain knowledge and open SKGs, bringing us closer to a more interconnected and efficient scientific community. 
Stay tuned for more updates on the progress of SciLake!

Read more …SciLake 2nd Plenary Meeting