Skip to main content

Cancer

    • Georgios Gavriilidis, This email address is being protected from spambots. You need JavaScript enabled to view it.
    • Daniel Hägerstrand, This email address is being protected from spambots. You need JavaScript enabled to view it.
  • CS Organisations:

This pilot will facilitate and empower the identification of a number of biomarkers with risk-stratifying or predictive impact that are relevant to provide personalized treatment and care, i.e., precision medicine. It will leverage the functionalities of the Scientific Lake service to create a domain-specific knowledge graph that can effectively model the respective knowledge space by relying on the defined (internal or external) data sources. These include clinically relevant databases (such as CLINVAR and COSMIC) as well as topic-specific databases (such as DBSnp, KEGG and OMIM). KI & CERTH will leverage this knowledge graph along with an adapted version of the functionalities of the scientific merit-driven knowledge space navigation services to facilitate the identification and retrieval of new connecting elements between the biomarkers.

Indicative such scenarios will be: (a) offering compiled evidence for a biomarker (such as target gene recommendations), based on a user-defined literature corpus (e.g., through keywords) and a user-defined dataset, and taking into consideration the level of replication of the corresponding research studies and experiments as quality indicators, (b) using a user-provided dataset of patient cohort molecular profiles to aggregate all relevant data points from internal and external databases, and generate the highest ranking path that connects heterogenous research objects that can effectively stratify the patient cohort, and (c) leveraging the SciLake API to generate a service bundle that can be applied directly to sensitive data, therefore allowing for the direct application of federated machine learning approaches.


Background

The cancer research community engages in biomarker discovery and validation, the clinical interpretation of variants that appear in the genomic sequences of tumor in cancer patients,  the risk stratification of patients and the prediction of therapy response particularly in regard to targeted therapies. Cancer is a broad term encompassing a wide and diverse range of different diseases, where several distinct entities can be discerned based on both cell-intrinsic and cell-extrinsic parameters, even within each cancer type. When several such parameters are considered collectively, the formed subgroups with clinically relevant differences are most often too small to be studied with significant power in a single institution or even at a national level. Therefore, methods that simplify data agglomeration and comparative studies on an international basis are highly warranted. To develop and showcase these types of methods, chronic lymphocytic leukemia (CLL), the most common leukemia, poses a representative cancer type, since recent advances in biomarker discovery and targeted therapies have enabled some extent of categorisation of patients according to clinical course and response to therapy.

Domain-specific data/metadata

In the context of this proposal, a combination of institutional and publicly available data will be used. Specifically, the services developed by the SciLake project will be interlinked with aggregated output that is continuously produced through analysis of omics data (such as RNA-seq, amplicon,-based sequencing and whole-genome sequencing), as these are generated by national efforts (including Precision Medicine Initiatives) and cancer research projects. In addition, publicly available domain-specific data from sources such as PubMed (literature, information retrieval) and ClinVar (aggregation of information about genomic variation and its relationship to human health) will also be taken into consideration.

Current needs & challenges

One of the main bottlenecks in cancer research is enriching the patient-specific data, as well as inferring potential associations of the identified biomarkers to additional elements (genes, pathways, drugs, etc.). Specifically:

  • Integration and linking of data and information between external and publicly available databases, and internal repositories and/or patient cohort data is definitely warranted. The integration should allow for human readability, as well as support the extension with additional data points.
  • Given a particular set of biomarkers from within the KG, the challenge is to retrieve and identify additional elements that support new associations / links between them. Of particular importance is the ability to offer new insights for novel biomarkers, through (semi)-automated mining of the relevant literature.

Expected outcomes

  • Use of the Scientific Lake service to directly interface EOSC resources to the relevant databases and construct a domain-specific KG, representing the domain which relates to the respective research questions (incl. patient stratification based on disease prediction).

  • Use of the impact analysis service and reproducibility analysis service to prioritize reading in the context of the clinical interpretation of genomic variants that have been identified as interesting via NGS analysis.

Discover SciLake Cancer Pilot

Scilake Pilots

The SciLake Cancer Knowledge Graph

SciLake is in full swing with its pilot programs in the fields of neuroscience, cancer research, transportation, and energy. These initiatives aim to create or enrich domain-specific Scientific Knowledge Graphs that capture valuable knowledge from each scientific field.

The SciLake Cancer Pilot is developing a first-of-its-kind cancer knowledge graph, with the aim to make public resources in biology and cancer more accessible to the research community.

The case study in focus isChronic Lymphocytic Leukemia (CLL), the most prevalent adult leukemia. The cancer knowledge graph will assist in discovering essential biomarkers for personalised treatment and care, a critical step towards achieving precision medicine.

Leading the pilot are researchers from the Centre for Research and Technology in Greece and the Karolinska Institutet in Sweden.

Read the press release


Unlocking insights in Cancer Research through Knowledge Graphs


Case Study

Unlocking insights in Cancer Research through Knowledge Graphs

By Stefania Amodeo

In a recent meeting with theEOSC4Cancer Cancer Landscape Partnering (CLP), SciLake took center stage as it introduced its vision and roadmap for unlocking insights in cancer research. Project coordinator,Thanasis Vergoulis (Athena RC), andLeily Rabbani, bioinformatician at the Department of Molecular Medicine and Surgery in Karolinska Institute, discussed the ongoing work towards creating aCancer Research Knowledge Graph. This innovative tool will provide context and connections for what is known about specific research questions, helping researchers as they design new experiments.

TheSciLake Cancer Research pilot involves the Institute of Applied Bioscience(INAB-CERTH) in Greece andKarolinska Institutet in Sweden. Focused on meeting the needs of researchers and clinicians, the project aims to harness the wealth of information available in public resources to address ongoing research questions. 

The ultimate goal? To deepen our understanding of the molecular biology and immunopathology ofChronic Lymphocytic Leukemia (CLL) and study the potential effects of different mutations.

With the assistance of SciLake technical partners, members of the pilot project are utilising advanced algorithms to discover new insights from the knowledge graph. For example, one interesting question they are exploring is, "how might a specific genetic mutation forecast a patient's overall health status and what insights might related literature offer in this regard?"


Chronic Lymphocytic Leukemia (CLL)

  • Characterized by the accumulation of neoplastic B cells in the bone marrow, blood cells, and secondary lymph nodes.

  • Patients can have a very diverse genetic landscape leading to heterogeneous clinical outcomes. This means progression rates and responses to drugs can vary greatly among patients.

  • The most common type of leukemia in adults.

  • Currently incurable.

Knowledge Graph: Benefits

The use of a scientific knowledge graph offers several benefits. It empowers research in precision medicine and diagnostics by facilitating the discovery of potential associations between identified biomarkers and other elements, such as genes, biological or functional pathways, and drugs. Furthermore, it is easily deployable and flexible, capable of integrating data from various sources, thereby offering a comprehensive view of the research landscape.

Challenges

Developing a knowledge graph comes with its own set of challenges. The objective is to provide tools for creating and enriching the graph, and the primary concern is extracting latent knowledge to create the graph. Another significant challenge is establishing a common language among people of different expertise, such as clinicians and technical developers. This is crucial to facilitate effective communication and collaboration in the development and application of the knowledge graph. Finally, an important step to validate the graph involves manual curation to assess hidden associations and existing connections and ensure they are relevant to the specific biology experiment.

Where are we

The development of the knowledge graph is progressing by leveraging several pre-existing state-of-the-art knowledge graphs. One isPrimeKG, which is used to query networks of genes or proteins connected to a specific disease. For example, the graph shows connections between CLL and TP53, a gene known for its potential to increase the risk of various forms of cancer significantly when altered. Other larger state-of-the-art knowledge graphs, based on various biomedical databases, are also being integrated along with Prime KG. This strategy aims to capitalise on a broader set of databases and underlying connections, potentially uncovering new missing links. An example of this is the revealed relation between CLL and the gene SOD1, known for being overexpressed in many human cancers.

A variety of knowledge graphs exist, each drawing from a different biomedical source, and we can collect more information through their combination. In fact, many details are unique to a particular graph and there is minimal overlap between them.

Our data flow involves using a variety of knowledge graphs, including those previously mentioned, along with different ontologies and other data sources. We utilize tools provided by SciLake to establish connections among them and generate a comprehensive cancer knowledge graph.

Dataflow towards a CLL KG

Goals

The ultimate vision is to create a comprehensive network of interconnected nodes and relationships. These nodes can represent various entities such as institutions, grants, patents, publications, software, anatomical structures, diseases, drugs, compounds, gene targets, and many more. These relationships can take on different forms and can signify different types of connections such as mentions, associations, or other types of relationships. By creating this extensive web of connections, the network can be navigated and queried in a semi-automated manner to answer specific research questions. Moreover, the impact and reproducibility analysis services offered by SciLake can be utilized to prioritize findings.

This approach will enhance the understanding of the molecular biology and immunopathology of Chronic Lymphocytic Leukemia. It will also assist in studying the potential effects of different mutations. The advantage of this method lies in its ability to incorporate information from multiple sources simultaneously, offering a comprehensive and insightful analysis.