Skip to main content

Discover SciLake CCAM Transport Pilot


Scilake Pilots

Guiding CCAM Research through Scientific Knowledge Graphs

SciLake is in full swing with its pilot programs in the fields of neuroscience, cancer research, transportation, and energy. These initiatives aim to create or enrich domain-specific Scientific Knowledge Graphs that capture valuable knowledge from each scientific field.

The SciLake CCAM Transport Pilot aims to consolidate the existing knowledge around Cooperative Connected Automated Mobility (CCAM) to enrich the Transportation Scientific Knowledge Graph (SKG) of the OpenAIRE Community.

Leading the pilot are researchers from I-SENSE, a Research Group of the Institute of Communication and Computer Systems (ICCS) of the National Technical University of Athens.

Read the press release


transportation

Read more …Discover SciLake CCAM Transport Pilot

SciLake’s “Energy pilot” launches “Regional Energy Planning Pilot”

Discover SciLake Energy Pilot


Scilake Pilots

SciLake Regional Energy Planning Pilot for Sustainable Regions

SciLake is in full swing with its pilot programs in the fields of neuroscience, cancer research, transportation, and energy. These initiatives aim to create or enrich domain-specific Scientific Knowledge Graphs that capture valuable knowledge from each scientific field.

The SciLake Regional Energy Planning (REP) pilot focuses on sustainable energy solutions tailored to regional contexts.

The pilot's goal is to improve the accessibility and interoperability of scientific knowledge in order to facilitate the Regional Energy Transition.

Leading the pilot are researchers from HES-SO School of Engineering in Switzerland.

Read the press release


energy

Read more …Discover SciLake Energy Pilot

Unlocking insights in Cancer Research through Knowledge Graphs


Case Study

Unlocking insights in Cancer Research through Knowledge Graphs

By Stefania Amodeo

In a recent meeting with the EOSC4Cancer Cancer Landscape Partnering (CLP), SciLake took center stage as it introduced its vision and roadmap for unlocking insights in cancer research. Project coordinator, Thanasis Vergoulis (Athena RC), and Leily Rabbani, bioinformatician at the Department of Molecular Medicine and Surgery in Karolinska Institute, discussed the ongoing work towards creating a Cancer Research Knowledge Graph. This innovative tool will provide context and connections for what is known about specific research questions, helping researchers as they design new experiments.

The SciLake Cancer Research pilot involves the Institute of Applied Bioscience (INAB-CERTH) in Greece and Karolinska Institutet in Sweden. Focused on meeting the needs of researchers and clinicians, the project aims to harness the wealth of information available in public resources to address ongoing research questions. 

The ultimate goal? To deepen our understanding of the molecular biology and immunopathology of Chronic Lymphocytic Leukemia (CLL) and study the potential effects of different mutations.

With the assistance of SciLake technical partners, members of the pilot project are utilising advanced algorithms to discover new insights from the knowledge graph. For example, one interesting question they are exploring is, "how might a specific genetic mutation forecast a patient's overall health status and what insights might related literature offer in this regard?"


Chronic Lymphocytic Leukemia (CLL)

  • Characterized by the accumulation of neoplastic B cells in the bone marrow, blood cells, and secondary lymph nodes.

  • Patients can have a very diverse genetic landscape leading to heterogeneous clinical outcomes. This means progression rates and responses to drugs can vary greatly among patients.

  • The most common type of leukemia in adults.

  • Currently incurable.

Knowledge Graph: Benefits

The use of a scientific knowledge graph offers several benefits. It empowers research in precision medicine and diagnostics by facilitating the discovery of potential associations between identified biomarkers and other elements, such as genes, biological or functional pathways, and drugs. Furthermore, it is easily deployable and flexible, capable of integrating data from various sources, thereby offering a comprehensive view of the research landscape.

Challenges

Developing a knowledge graph comes with its own set of challenges. The objective is to provide tools for creating and enriching the graph, and the primary concern is extracting latent knowledge to create the graph. Another significant challenge is establishing a common language among people of different expertise, such as clinicians and technical developers. This is crucial to facilitate effective communication and collaboration in the development and application of the knowledge graph. Finally, an important step to validate the graph involves manual curation to assess hidden associations and existing connections and ensure they are relevant to the specific biology experiment.

Where are we

The development of the knowledge graph is progressing by leveraging several pre-existing state-of-the-art knowledge graphs. One is PrimeKG, which is used to query networks of genes or proteins connected to a specific disease. For example, the graph shows connections between CLL and TP53, a gene known for its potential to increase the risk of various forms of cancer significantly when altered. Other larger state-of-the-art knowledge graphs, based on various biomedical databases, are also being integrated along with Prime KG. This strategy aims to capitalise on a broader set of databases and underlying connections, potentially uncovering new missing links. An example of this is the revealed relation between CLL and the gene SOD1, known for being overexpressed in many human cancers.

A variety of knowledge graphs exist, each drawing from a different biomedical source, and we can collect more information through their combination. In fact, many details are unique to a particular graph and there is minimal overlap between them.

Our data flow involves using a variety of knowledge graphs, including those previously mentioned, along with different ontologies and other data sources. We utilize tools provided by SciLake to establish connections among them and generate a comprehensive cancer knowledge graph.

Dataflow towards a CLL KG

Goals

The ultimate vision is to create a comprehensive network of interconnected nodes and relationships. These nodes can represent various entities such as institutions, grants, patents, publications, software, anatomical structures, diseases, drugs, compounds, gene targets, and many more. These relationships can take on different forms and can signify different types of connections such as mentions, associations, or other types of relationships. By creating this extensive web of connections, the network can be navigated and queried in a semi-automated manner to answer specific research questions. Moreover, the impact and reproducibility analysis services offered by SciLake can be utilized to prioritize findings.

This approach will enhance the understanding of the molecular biology and immunopathology of Chronic Lymphocytic Leukemia. It will also assist in studying the potential effects of different mutations. The advantage of this method lies in its ability to incorporate information from multiple sources simultaneously, offering a comprehensive and insightful analysis.

Cancer

Read more …Unlocking insights in Cancer Research through Knowledge Graphs

Open Scholarly Communication: Insights from the EOSC Winter School 2024


Workshop

Open Scholarly Communication: Insights from the EOSC Winter School 2024

By Stefania Amodeo

Open Scholarly Communication (OSC) is an important and evolving field that is gaining increasing attention in the scientific community. During the recent EOSC Winter School 2024, SciLake representatives and other leading experts gathered to discuss various aspects of OSC, from its definition to the challenges it faces, and its future trajectory. This blog post summarizes the key takeaways from this session.

Defining Open Scholarly Communication

The session started with a detailed exploration of the definition of open scholarly communication. The definitions provided by various entities such as SPARC, DOAJ, OpenAIRE, and MIT Libraries were examined. OSC was collectively emphasized as a process where research outputs are shared and disseminated openly and freely without barriers, enabling the democratisation of knowledge. The discussion highlighted that the research lifecycle (researching-writing-publishing-assessing) begins with infrastructures. Among these, OpenAIRE is a non-profit partnership of 50 organizations with the mission to establish a permanent open scholarly communication infrastructure to support European research. In addition to its infrastructure and services, the ongoing CRAFT-OA project aims to consolidate the Diamond Open Access publishing landscape and integrate it with EOSC and other large-scale aggregators. OPERAS is another infrastructure spanning 27 domains, primarily covering the EU but also involving international collaborations, with the goal of becoming an ERIC by 2027.

Challenges in Open Scholarly Communication

While OSC holds great promise for the future of knowledge dissemination, it does come with its challenges. These include the interdisciplinary nature of research, different research methods, a variety of publication formats, the need for multilingualism, the existence of numerous open access initiatives, and fragmentation of actors. The session also highlighted the issue of volume limitation for the number of publications processed per month and the challenge of achieving true interoperability. Despite having technical standards in place, achieving real interoperability requires a lot of manual work and collaboration between stakeholders, and resistance to change often hinders this process.

SciLake's Role

SciLake is a comprehensive project designed to assist scientific communities in constructing their own scientific knowledge graphs (SKGs). One of its primary challenges is enabling interoperability between domain-specific research and domain-agnostic databases such as the OpenAIRE Graph. The project is in line with the EOSC interoperability framework and actively collaborates with other scientific knowledge graphs to define core model entities through the Research Data Alliance (RDA) SKG Interoperability Framework.

The Future of Open Scholarly Communication

Looking ahead, the discussions pointed to the need for a clearer understanding of EOSC's scope. There was a suggestion to consider running the Diamond model as part of EOSC. This model, which combines open access, no author processing charges (APC), and scholarly ownership, was seen as a potential route to facilitate open scholarly communication. The future of OSC also hinges on improved interoperability. Even though technical solutions and platforms exist, the human resources level remains a challenge. The session concluded with a proposal to form a Task Force on Open Scholarly Communication. This team will focus on concrete actions that require resolution and cannot be addressed within a single project.

Conclusion

The EOSC Winter School 2024 provided a valuable platform for in-depth discussions on Open Scholarly Communication, fostering an understanding of its definition, evaluating the challenges it faces, and envisioning its future. It became clear that while OSC is a powerful tool for democratizing knowledge and enhancing research impact, it requires strategic interventions to overcome challenges, particularly in terms of interoperability. As the field continues to evolve, the scholarly community looks forward to continued dialogue and progress in OSC.

Read a full recap of the event here.

Event

Read more …Open Scholarly Communication: Insights from the EOSC Winter School 2024

Domain-Specific Machine Translation for SciLake


SciLake Technical components

Domain-Specific Machine Translation for SciLake

By Stefania Amodeo

In a recent webinar for SciLake partners, Sokratis Sofianopoulos and Dimitris Roussis from Athena RC presented their cutting-edge Machine Translation system, which will be integrated into the Scientific Lake Service.

The presentation emphasized the significance of gathering domain-specific parallel data to train machine translation systems. It specifically focused on the language pairs of French to English, Spanish to English, and Portuguese to English. Additionally, a demonstration of a few models created so far was showcased during the presentation.

This blog post recaps the key points from the webinar.

Challanges

Machine translation involves the automated process of translating text or speech from one language to another. The current state of the art utilizes deep learning approaches like sequence-to-sequence models and transformers. However, challenges arise when applying machine translation to specialized domains such as medical or scientific documents due to technical terms. The solution is to collect in-domain data and use them to fine-tune existing state-of-the-art machine translation models of the general domain for the specific language pair of interest.

Data collection

The team has focused on collecting domain-specific parallel data from the four SciLake pilot domains: Cancer Research, Energy Research, Neuroscience, and Transportation Research. Additionally, they have gathered general-purpose scientific data to ensure comprehensive coverage.

The data collection process involved downloading approximately 9.3 million records from 62 repositories, including theses, dissertations, and various scientific documents. The team meticulously parsed and classified the records, extracting metadata such as authors' names, abstracts, and titles. The result is a vast collection of parallel and monolingual sentences in English, Spanish, French, and Portuguese.

To ensure the quality of the machine translation systems, the team created benchmark test sets for each domain. These sets consist of 1000 parallel sentences, divided into developer and test sets. Additionally, a larger test set of 3000 parallel sentences was created for the general academic domain. These test sets allow for the evaluation of the fine-tuned models.

Translation Models

To enhance the performance of the machine translation models, the team utilized a combination of in-domain data and general academic data. Since in-domain data was limited, the team incorporated as much relevant data as possible to improve the performance of the general-purpose models. For language pairs such as French to English and Spanish to English, the team employed the Big Transformers architecture, which consists of deep learning models with approximately 213 million parameters. For Portuguese to English, a Base Transformers architecture with fewer (65 millions) parameters was used.

The initial results, reported in the table below, show that current models (fine-tuned with only in-domain data as well as fine-tuned with a combination of in-domain data as well as general academic data) perform reasonably well. The evaluation scores are based on two well-established metrics used in machine translation for the past decade: the BLEU and the COMET indices, which are computed by comparing the machine-generated translations with reference translations. Notably, the French to English system reported the lowest scores, likely due to the limited amount of data available for this specific language pair.

The results also show a significant improvement in scores after fine-tuning with both in-domain and academic corpus data. On average, mixing data resulted in a 2.5% increase in scores. The in-domain data contributed 1.5+ points to the scores, while the general academic data added almost an additional point.


How can we improve the results?

These findings are based on the first year of working with the fine-tuning process. The team has numerous ideas for further improving the results, including exploring multi-domain adaptation, finding more data sources, and using machine translation for back-translation to generate additional data. Additionally, they plan to integrate the machine translation models with the SciLake infrastructure and collaborate with project partners to maximize the potential benefits.

Demo

The team presented a web interface for translation requests and showcased the capabilities of the platform. Currently, this system is an internal prototype used for evaluating machine translation models and experimenting with various processes, such as generating text for future post-editing, and exploring new ideas. As an example, a Firefox plugin has been developed to allow users to request webpage translation using the Spanish model. This plugin is useful for translating Spanish articles while browsing.

Conclusions

Overall, the presentation offered valuable insights into the process of fine-tuning and evaluating machine translation models for specialized domains. With ongoing research and improvements, integrating domain-specific machine translation into the SciLake infrastructure holds great potential for enhancing scientific communication and collaboration.

machine-translation-news, training

Read more …Domain-Specific Machine Translation for SciLake