6 7 25

Juho Inkinen

juhoinkinen

AI & ML interests

NLP

Recent Activity

posted an update about 15 hours ago

We (@osma, @MonaLehtinen & me, i.e. the Annif team at the National Library of Finland) recently took part in the LLMs4Subjects challenge at the SemEval-2025 workshop. The task was to use large language models (LLMs) to generate good quality subject indexing for bibliographic records, i.e. titles and abstracts. We are glad to report that our system performed well; it was ranked 🥇 1st in the category where the full vocabulary was used 🥈 2nd in the smaller vocabulary category 🏅 4th in the qualitative evaluations. 14 participating teams developed their own solutions for generating subject headings and the output of each system was assessed using both quantitative and qualitative evaluations. Research papers about most of the systems are going to be published around the time of the workshop in late July, and many pre-prints are already available. We applied Annif together with several LLMs that we used to preprocess the data sets: translated the GND vocabulary terms to English, translated bibliographic records into English and German as required, and generated additional synthetic training data. After the preprocessing, we used the traditional machine learning algorithms in Annif as well as the experimental XTransformer algorithm that is based on language models. We also combined the subject suggestions generated using English and German language records in a novel way. More information can be found in our system description preprint: https://huggingface.co/papers/2504.19675 See also the task description preprint: https://huggingface.co/papers/2504.07199 The Annif models trained for this task are available here: https://huggingface.co/NatLibFi/Annif-LLMs4Subjects-data

updated a collection about 15 hours ago

Annif papers

replied to their post about 1 month ago

Annif is a subject indexing toolkit developed by the National Library of Finland: https://github.com/NatLibFi/Annif Last November we organized a survey for Annif users, and now the results have been published: https://www.doria.fi/bitstream/handle/10024/190930/Annif%20Users%20Survey.pdf The report includes an overview of: • The vocabularies and datasets that are used with Annif • The workflows that Annif is integrated with • The problems Annif users are facing The average ratings for various aspects and features of Annif given by users are shown. In short, in a scale from 1 to 5, the ratings are: • Overall: 4.4 • Features and functions: 4.1 • Documentation: 4.5 • Smoothness of initial setup: 4.2 • Usability: 4.4 • Achieved quality of subject suggestions: 3.6 The survey also gathered user views on the improvements and new features, which are briefly discussed in the report.

View all activity

Organizations

Posts 3

Post

450

We ( @osma , @MonaLehtinen & me, i.e. the Annif team at the National Library of Finland) recently took part in the LLMs4Subjects challenge at the SemEval-2025 workshop. The task was to use large language models (LLMs) to generate good quality subject indexing for bibliographic records, i.e. titles and abstracts.

We are glad to report that our system performed well; it was ranked

🥇 1st in the category where the full vocabulary was used
🥈 2nd in the smaller vocabulary category
🏅 4th in the qualitative evaluations.

14 participating teams developed their own solutions for generating subject headings and the output of each system was assessed using both quantitative and qualitative evaluations. Research papers about most of the systems are going to be published around the time of the workshop in late July, and many pre-prints are already available.

We applied Annif together with several LLMs that we used to preprocess the data sets: translated the GND vocabulary terms to English, translated bibliographic records into English and German as required, and generated additional synthetic training data. After the preprocessing, we used the traditional machine learning algorithms in Annif as well as the experimental XTransformer algorithm that is based on language models. We also combined the subject suggestions generated using English and German language records in a novel way.

More information can be found in our system description preprint: Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs (2504.19675)

See also the task description preprint: SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog (2504.07199)

The Annif models trained for this task are available here: NatLibFi/Annif-LLMs4Subjects-data

Post

637

Annif is a subject indexing toolkit developed by the National Library of Finland: https://github.com/NatLibFi/Annif

Last November we organized a survey for Annif users, and now the results have been published: https://www.doria.fi/bitstream/handle/10024/190930/Annif%20Users%20Survey.pdf

The report includes an overview of:
• The vocabularies and datasets that are used with Annif
• The workflows that Annif is integrated with
• The problems Annif users are facing

The average ratings for various aspects and features of Annif given by users are shown. In short, in a scale from 1 to 5, the ratings are:
• Overall: 4.4
• Features and functions: 4.1
• Documentation: 4.5
• Smoothness of initial setup: 4.2
• Usability: 4.4
• Achieved quality of subject suggestions: 3.6

The survey also gathered user views on the improvements and new features, which are briefly discussed in the report.

View all Posts