Skip to main content
Case Study


This pilot will facilitate and empower the identification of a number of biomarkers with risk-stratifying or predictive impact that are relevant to provide personalized treatment and care, i.e., precision medicine. It will leverage the functionalities of the Scientific Lake service to create a domain-specific knowledge graph that can effectively model the respective knowledge space by relying on the defined (internal or external) data sources. These include clinically relevant databases (such as CLINVAR and COSMIC) as well as topic-specific databases (such as DBSnp, KEGG and OMIM). KI & CERTH will leverage this knowledge graph along with an adapted version of the functionalities of the scientific merit-driven knowledge space navigation services to facilitate the identification and retrieval of new connecting elements between the biomarkers.

Indicative such scenarios will be: (a) offering compiled evidence for a biomarker (such as target gene recommendations), based on a user-defined literature corpus (e.g., through keywords) and a user-defined dataset, and taking into consideration the level of replication of the corresponding research studies and experiments as quality indicators, (b) using a user-provided dataset of patient cohort molecular profiles to aggregate all relevant data points from internal and external databases, and generate the highest ranking path that connects heterogenous research objects that can effectively stratify the patient cohort, and (c) leveraging the SciLake API to generate a service bundle that can be applied directly to sensitive data, therefore allowing for the direct application of federated machine learning approaches.


The cancer research community engages in biomarker discovery and validation, the clinical interpretation of variants that appear in the genomic sequences of tumor in cancer patients,  the risk stratification of patients and the prediction of therapy response particularly in regard to targeted therapies. Cancer is a broad term encompassing a wide and diverse range of different diseases, where several distinct entities can be discerned based on both cell-intrinsic and cell-extrinsic parameters, even within each cancer type. When several such parameters are considered collectively, the formed subgroups with clinically relevant differences are most often too small to be studied with significant power in a single institution or even at a national level. Therefore, methods that simplify data agglomeration and comparative studies on an international basis are highly warranted. To develop and showcase these types of methods, chronic lymphocytic leukemia (CLL), the most common leukemia, poses a representative cancer type, since recent advances in biomarker discovery and targeted therapies have enabled some extent of categorisation of patients according to clinical course and response to therapy.

Domain-specific data/metadata

In the context of this proposal, a combination of institutional and publicly available data will be used. Specifically, the services developed by the SciLake project will be interlinked with aggregated output that is continuously produced through analysis of omics data (such as RNA-seq, amplicon,-based sequencing and whole-genome sequencing), as these are generated by national efforts (including Precision Medicine Initiatives) and cancer research projects. In addition, publicly available domain-specific data from sources such as PubMed (literature, information retrieval) and ClinVar (aggregation of information about genomic variation and its relationship to human health) will also be taken into consideration.

Current needs & challenges

One of the main bottlenecks in cancer research is enriching the patient-specific data, as well as inferring potential associations of the identified biomarkers to additional elements (genes, pathways, drugs, etc.). Specifically:

  • Integration and linking of data and information between external and publicly available databases, and internal repositories and/or patient cohort data is definitely warranted. The integration should allow for human readability, as well as support the extension with additional data points.
  • Given a particular set of biomarkers from within the KG, the challenge is to retrieve and identify additional elements that support new associations / links between them. Of particular importance is the ability to offer new insights for novel biomarkers, through (semi)-automated mining of the relevant literature.

Expected outcomes

  • Use of the Scientific Lake service to directly interface EOSC resources to the relevant databases and construct a domain-specific KG, representing the domain which relates to the respective research questions (incl. patient stratification based on disease prediction).

  • Use of the impact analysis service and reproducibility analysis service to prioritize reading in the context of the clinical interpretation of genomic variants that have been identified as interesting via NGS analysis.

Organisations Involved


Georgios Gavriilidis
Daniel Hägerstrand

Resources & Links

Coming soon...