Skip to main content

AffilGood: Enhancing Scientific Attribution


SCILAKE TECHNICAL COMPONENTS

AffilGood: Enhancing Scientific Attribution

By Stefania Amodeo

In the world of scientific research, accurate attribution is paramount. However, linking scientific works to research organizations has long been a challenge, primarily due to the scarcity of openly available annotated data describing institutional affiliations. This issue is particularly pronounced when dealing with complex and multilingual affiliation strings.

Our partners at SIRIS Academic have developed AffilGood, an innovative framework designed to enhance the identification and linking of institutions from raw, multilingual strings, significantly improving affiliation metadata in Scientific Knowledge Graphs.

What is AffilGood?

AffilGood is a multifaceted framework addressing the complexities of institution name disambiguation in scientific literature. It consists of two primary components:

  • A robust collection of datasets for extracting information from raw affiliation strings
  • An entity linking module that connects organisations mentioned in affiliations or research projects to ROR (Research Organization Registry) identifiers

Tackling Complex Challenges

AffilGood excels in handling various challenging scenarios such as:

  • Processing noisy or incomplete input data
  • Managing affiliations in languages other than English or with mixed languages
  • Navigating complex affiliations involving diverse institution types (e.g., companies, universities, hospitals, research centres) at different hierarchical levels

Recent Developments

Our partners, Nicolau Duran-Silva and Pablo Accuosto, recently showcased AffilGood at the Fourth Workshop on Scholarly Document Processing. Their paper, "AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis", offers in-depth insights into the framework's capabilities and its potential impact on scientific literature analysis.

For those interested in the technical details, the full paper is available in the workshop proceedings:

Duran-Silva, N., Accuosto, P., Przybyła, P. and Saggion, H., 2024, August. AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis. In Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024) (pp. 135-144). URL: https://aclanthology.org/2024.sdp-1.13.pdf

SciLake Integration

We're excited to announce the integration of AffilGood into SciLake's "Knowledge Graph creation assistant" as an institution disambiguation pipeline. This integration will significantly enhance our ability to accurately identify and link institutional affiliations across the research landscape.

By improving the quality of affiliation metadata in scientific knowledge graphs, we're taking a major step forward in scientific literature analysis. As we refine this technology, we aim to streamline institution name disambiguation, ultimately enabling more precise and efficient attribution of scientific works.


Read more …AffilGood: Enhancing Scientific Attribution

EOSC projects advancing the research assessment reform


Horizon Europe INFRAEOSC projects

EOSC projects advancing the research assessment reform

By Stefania Amodeo, Lottie Provost

SciLake and GraspOS are Horizon Europe INFRAEOSC projects recognized as key contributors to the research assessment reform by the European Commission. This joint blog post outlines each project's aims and contributions to the reform.

Towards a reform of the research assessment system in Europe

The European Commission (EC) has taken significant strides towards reforming research assessment practices, aiming to enhance the diversity, transparency, and quality of research assessment across the European Union. 

Central to the reform efforts is the European Open Science Cloud (EOSC), which provides a federated infrastructure designed to facilitate access to data and the tracking of scientific information. This infrastructure connects various research organisations, enabling them to share and access high-quality data, metadata, services, and tools, thereby enhancing research assessment practices. The EC supports these efforts through dedicated funding for Horizon Europe INFRAEOSC projects, aimed at enabling an operational, open, and FAIR (Findable, Accessible, Interoperable, and Reusable) EOSC ecosystem.

In April 2024, following its signing of the Agreement on Reforming Research Assessment (ARRA), the EC published it Action Plan to implement the ten commitments outlined in the Agreement which includes a series of measures that the EC plans to carry out to further support and advance the reform.

The EC Report on Research Assessment was published in the same month, and addresses the contributions of Horizon 2020 and Horizon Europe projects to the reform of research assessment practices, following on the ARRA principles and commitments.

In both EC documents, the Horizon Europe INFRAEOSC projects GraspOS and SciLake were jointly recognised as key contributors to the research reform. This collaborative blog post by SciLake and GraspOS provides an overview of the aims of each project and presents their specific contributions to the reform in the areas recognised by the EC.

Next Generation Research Assessment to Promote Open Science

Democratising and making sense out of heterogeneous scholarly content

GraspOS and Scilake emerge as key players in the research assessment reform 

The EC Report on Research Assessment identifies four clusters of activities from Horizon 2020 and Horizon Europe projects, including direct and indirect relevant contributions, from both planned and completed activities. Below, we highlight the clusters and areas where GraspOS and SciLake support the reform as indicated by the EC. 

Cluster 1: strengthen the evidence base in which to ground systemic reform and organisational change

  • the GraspOS landscape review of approaches, tools and services for responsible research assessment that addresses open science, 

Cluster 2: development of frameworks and models to inform organisational and system-level practices

Cluster 3: contributing to the infrastructure and dedicated tools and services for research assessment, including automated tools.

  • SciLake is developing data storage and AI-assisted analytic services built on customised takes on scientific merit, and AI-assisted services for automated assessment of reproducibility and replication and of scientific, societal or economic impact.
  • GraspOS is working towards an open and federated infrastructure for research assessment including tools and services, in support of Open Science-aware responsible assessment approaches. 

The EC Report also highlights promising practices identified across the projects.

Promising practices
Contribution of the EOSC projects

Embedding and recognising open science practices

GraspOS, SciLake

Catalysing and inspiring organisational change

GraspOS: provision of guidance and good practice examples

GraspOS, SciLake: provision of tools and services for use and adaptation by institutions in their benchmarking, analytics and assessment

Self-reflective approach to the development of recommendations and frameworks, for example through trialling or piloting

GraspOS, SciLake

Assessment of impact

SciLake

Foster integration and synergies among different projects and also between projects and existing policy, professional or scholarly frameworks

GraspOS: commitment to EOSC integration of its open-science-aware research assessment tools.

Current progress and good practices in relation to the agreement on reforming research assessment

The identification of promising aspects and practices are also considered in relation to their level of alignment with the principles and commitments of the ARRA, and more broadly with RRI principles. Specifically, the ongoing work of GraspOS and SciLake projects is mentioned in four principles of the reform, with explicit reference to how the projects outputs under development address each principle.  

Independence and transparency

The ARRA emphasises the principles of independence and transparency in data, infrastructure, and criteria for research assessment, as well as in the analysis of research impact and transparency of indicators.
SciLake aligns with these principles by developing a service for calculating indicators for a variety of research products which will be built on transparency, the project will also provide open-source codes ensuring transparency in indicator calculations. 

Recognising impacts

The ARRA recognises the importance of contributions that advance knowledge and the potential impacts of research results at scientific, technological and/or societal level.
SciLake addresses this principle through the development of a service to manage heterogeneous scientific content with the aim to support the creation of new methodologies  for multi-perspective impact assessment, providing customizable strategies. Specifically, the project will create an impact-based discovery service which calculates impact indicators for research outputs such as publications, datasets, and software to analyse scientific, societal, and economic impact. 

Recognition of the diversity of research activities and outputs

The recognition of a wide range of research activities and practices, including peer review, PhD supervision, leadership roles, science communication, and knowledge valorization, is a central principle in the ARRA. It also includes valuing a variety of outputs such as scientific publications, data, software, algorithms, and policy contributions, and includes rewarding open science practices.
SciLake project aims to provide a technical solution which supports the assessment of a broader range of research outputs. The project will provide tools for assessing the impact and reproducibility of diverse outputs, including datasets, workflows, and protocols.
GraspOS project is developing an open and federated infrastructure for Open Science-aware, responsible research assessment. A landscape report identifies current open science assessment practices, barriers, and priorities, and highlights the need for holistic recognition of research, education, and other activities.

Criteria and processes that respect the variety of disciplines, research types and career stages

In the ARRA, two principles underline the need for assessment criteria and practices that reflect the diversity of research, including the recognition of interdisciplinarity and various research roles, both within and outside academia.
Preparatory work in the GraspOS project has indicated that best practices need to be contextualised to each assessment event and that identifying universal best practices is therefore not feasible. To address this, the project aims to develop multiple assessment portfolios, including tailorable dashboard services for funders, institutions, research teams and disciplines, along with templates for qualitative and quantitative indicators.

Conclusions

By leveraging the federated infrastructure of EOSC and supporting key projects such as GraspOS and SciLake, the EC is fostering an environment where data sharing and collaboration are rewarded, ultimately driving the advancement of Open Science and enhancing the impact and credibility of research across the European Union.


Read more …EOSC projects advancing the research assessment reform

SciLake's partners from Athena RC present advancements in Machine Translation at the 25th Annual Conference of The European Association for Machine Translation

Machine Translation for the Scientific Domain


Workshop

Machine Translation for the Scientific Domain

By Stefania Amodeo, Sokratis Sofianopoulos

From June 24th to 27th, 2024, Sheffield, UK, hosted the 25th Annual Conference of The European Association for Machine Translation (EAMT). This event gathered leading researchers and companies in the field of Machine Translation to discuss the latest advancements, methodologies, and challenges. Representing SciLake were our partners from Athena Research Centre: Dimitris Roussis, Sokratis Sofianopoulos, and Stelios Piperidis.

The conference covered a wide array of topics, including linguistic resources, evaluation techniques, and multilingual technologies. The audience comprised researchers and industry professionals dedicated to pushing the boundaries of Machine Translation.

SciLake participated in a poster session highlighting innovative work on translation models. The presentation included a conference paper titled "Enhancing Scientific Discourse: Machine Translation for the Scientific Domain," which outlines the development of parallel and monolingual corpora for the scientific field, focusing on language pairs like Spanish-English, French-English, and Portuguese-English. These corpora include a comprehensive general scientific corpus and four specialised corpora targeting the project’s research areas: Cancer Research, Energy Research, Neuroscience, and Transportation Research. The paper details the corpus creation process, the fine-tuning strategies used, and concludes with a thorough evaluation of the results.

This work aims to bridge the language gap in scientific research by developing general-purpose neural machine translation (NMT) models that can accurately and fluently translate scientific text across various specialized domains, ensuring that critical research is accessible to a global audience.

The draft versions of the conference proceedings are available online at https://eamt2024.sheffield.ac.uk/programme/proceedings, with the final publication expected to be published in the ACL Anthology.

FOLLOW THE DEVELOPMENTS OF THE SCILAKE MACHINE TRANSLATION SYSTEM

machine-translation-news

Read more …Machine Translation for the Scientific Domain

Daan de Graaf (TU/e) at GRADES-NDA ‘24

SciLake at GRADES-NDA ’24


Workshop

SciLake at GRADES-NDA ’24

By Stefania Amodeo, Daan de Graaf

SciLake recently participated in the 7th Joint Workshop on Graph Data Management Experiences Systems (GRADES) and Network Data Analytics (NDA), held on June 14, 2024, in Santiago, AA, Chile. This prestigious event unites researchers from academia, industry, and government sectors worldwide to discuss and share the latest breakthroughs in large-scale graph data management and graph analytics systems. It also provides a platform to discuss novel methods and techniques to address domain-specific challenges in real-world graphs.

Daan de Graaf (TU/e) at GRADES-NDA ‘24

Daan de Graaf (TU/e) at GRADES-NDA ‘24


Our SciLake partner, Daan de Graaf, had the opportunity to present an accepted article on behalf of authors Wilco van Leeuwen, George Fletcher, and Nikolay Yakovets, all from Eindhoven University of Technology (TU/e). The team showcased "HomeRun", a tool specifically designed for comparing different cardinality estimation techniques in graph databases.

For those new to the topic, the cardinality of a graph database refers to the number of elements in a set, such as the number of edges connected to a node or the total number of nodes in the database. Accurate cardinality estimation is crucial for optimising the performance of queries, as it helps plan the most efficient way to retrieve data.

One of HomeRun's key features is its ability to evaluate the performance of different cardinality estimation techniques in given usage scenarios. The tool generates visualisations automatically, helping users understand the trade-offs between various techniques. This tool is particularly useful for database developers when they face performance issues, like long-running queries, with specific query and dataset combinations.

In SciLake, HomeRun is being used to optimise the database system performance in the context of the WP2 Data Lake Search and Navigation.

For more information about HomeRun, you can refer to the paper:

  • Wilco van Leeuwen, George Fletcher, and Nikolay Yakovets. 2024. HomeRun: A Cardinality Estimation Advisor for Graph Databases. In Proceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA) (GRADES-NDA '24). Association for Computing Machinery, New York, NY, USA, Article 6, 1–9. https://doi.org/10.1145/3661304.3661902 

Event, Scientific Lake Service

Read more …SciLake at GRADES-NDA ’24

Discover SciLake Cancer Pilot

Scilake Pilots

The SciLake Cancer Knowledge Graph

SciLake is in full swing with its pilot programs in the fields of neuroscience, cancer research, transportation, and energy. These initiatives aim to create or enrich domain-specific Scientific Knowledge Graphs that capture valuable knowledge from each scientific field.

The SciLake Cancer Pilot is developing a first-of-its-kind cancer knowledge graph, with the aim to make public resources in biology and cancer more accessible to the research community.

The case study in focus is Chronic Lymphocytic Leukemia (CLL), the most prevalent adult leukemia. The cancer knowledge graph will assist in discovering essential biomarkers for personalised treatment and care, a critical step towards achieving precision medicine.

Leading the pilot are researchers from the Centre for Research and Technology in Greece and the Karolinska Institutet in Sweden.

Read the press release


Cancer

Read more …Discover SciLake Cancer Pilot