Resources

HABE-HiTZ C1 Dataset: A collection of essays, scores, and feedback generation in Basque.

HABE-HiTZ C1 dataset is the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion-specific scores for correctness, richness, coherence, cohesion, and task alignment, along with detailed feedback and error examples.

dataset: https://huggingface.co/datasets/EkhiAzur/HABE-HiTZ_C1_Dataset
paper: Automatic Essay Scoring and Feedback Generation in Basque Language Learning

The MATE dataset is designed to serve as an effective benchmark for evaluating the cross-modal entity-linking capabilities of current vision-language models (VLMs). It demonstrates that, while human performance remains consistently high, VLM performance degrades significantly as scene complexity increases, highlighting the challenges that cross-modal entity linking poses for current models.

github: https://github.com/hitz-zentroa/MATE
paper:Vision-Language Models Struggle to Align Entities across Modalities

vSTS: Visual Semantic Textual Similarity

The vSTS dataset aims to become a standard benchmark to test the contribution of visual information when evaluating the similarity of sentences and the quality of multimodal representations, allowing to test the complementarity of visual and textual information for improved language understanding.

website: https://oierldl.github.io/vsts
github: https://github.com/oierldl/vsts
paper: Evaluating Multimodal Representations on Visual Semantic Textual Similarity

Older resources

Sensecorpus, a corpus of examples from the web for all nouns in WordNet 1.6. The senses can be easily mapped to other WN versions here. (Smaller subset used in our EMNLP 2004 paper here).
Topic signatures for all nominal senses in WordNet
Sense Clustering data for WN 1.6 (RANLP 2003 paper)

Oier Lopez de Lacalle

Resources

HABE-HiTZ C1 Dataset: A collection of essays, scores, and feedback generation in Basque.

MATE: Cross-modal Entity Alignment Dataset

vSTS: Visual Semantic Textual Similarity

Older resources