MediaEval 2026 Benchmarking Initiative
Centre for Language Studies, Radboud University Nijmegen
Martial Pastor — martial.pastor [at] ru.nl · Nelleke Oostdijk — nelleke.oostdijk [at] ru.nl
The goal of this task is to develop AI models capable of detecting and reconstructing implicit arguments — also called enthymemes — in political tweets. The dataset contains tweets annotated for the presence of implicit premises or conclusions, along with full argument reconstructions provided by multiple annotators.
The first data sample will be released on 1 March 2025. Participants are invited to complete two tasks. While they may choose to complete only Task 1, completion of Task 2 is conditional upon prior completion of Task 1.
Given a tweet, determine whether it contains an implicit premise, an implicit conclusion, or neither. This is a three-class classification task.
| Input | Output |
|---|---|
| The raw text of a tweet. | One label: implicit_premise, implicit_conclusion, or none. |
An implicit premise is a supporting assumption left unstated that the argument relies on. An implicit conclusion is a claim that follows from the stated premises but is never explicitly made. When neither component is missing, the label is none.
Tweets in the train and dev sets are each annotated by five independent annotators; those in the test set by three. Individual annotator labels — prior to any majority vote — are provided alongside the data, making it possible to treat disagreement as signal rather than noise.
Predict the label from the tweet text alone. No external data or additional annotation information is permitted.
In addition to the tweet text, participants may use the raw labels provided by three independent annotators. The goal is to investigate whether modelling annotator disagreement improves performance, especially on borderline cases. The output label is the same three-class prediction.
| Additional Input | Goal |
|---|---|
| Three individual annotator labels per tweet (before majority vote). | Leverage disagreement as a signal to improve the three-class prediction. |
Any external data sources, pre-trained models, or additional resources may be used. Participants must document all external resources in their working-notes paper.
Performance is measured with macro F1-score. Evaluation is conducted in two modes:
implicit_premise, implicit_conclusion, and none.
implicit_premise and
implicit_conclusion are collapsed into a single
implicit label, reducing the task to a binary distinction between
tweets that contain any implicit argument component and those that do not.
This mode rewards the ability to detect the presence of an enthymeme regardless
of its structural role.
For each tweet classified as containing an implicit argument, generate the text of the missing proposition. Task 2 requires prior completion of Task 1, as the predicted label is part of the input.
| Input | Output |
|---|---|
Tweet text + Task 1 label
(implicit_premise or implicit_conclusion).
|
A single natural-language sentence expressing the missing proposition. |
The generated sentence should be concise and declarative — it should make the unstated assumption or conclusion fully explicit, as if completing the argument. See the example below.
In this example, the system should output: "Controlled immigration is desirable."
Task 2 systems are evaluated specifically on their ability to reconstruct the
implicit premise — the unstated supporting assumption the argument relies on.
The gold-standard reference for each tweet is the implicit premise provided by the annotators
in the dataset. In the annotation files, the implicit premise is identifiable as the
propositional content marked with (implicit), which annotators write as the
text of the missing proposition followed by that tag (e.g. "Vaccines cause harm to the
body (implicit).").
Generated propositions are evaluated in two ways. First, BERTScore is used to compare system outputs against the annotator-provided implicit premises. Second, a sampled subset of the test set is evaluated by hand by experienced human annotators, who assess whether the generated proposition correctly captures the implicit assumption underlying the argument.
Enthymemes — arguments with missing components — represent a fundamental challenge in understanding persuasive discourse. These implicit arguments are particularly prevalent in social media contexts, where they serve as powerful means of persuasion. By leaving key premises or conclusions unstated, enthymemes lead readers to perceive the implicit content as their own reasoning, making them especially effective rhetorical devices.
The significance of detecting and reconstructing enthymemes extends beyond theoretical interest in argumentation theory. Enthymemes facilitate deceptive argumentation and manipulation, and help in spreading disinformation. Understanding how implicit premises operate in controversial political discourse is therefore crucial for developing tools to combat misinformation and promote critical thinking.
The task is inherently interpretative, involving natural language inference and semantic interpretation where high human disagreement is common. Our approach explicitly acknowledges this by employing multiple independent annotators per instance, enabling us to treat human label variation as signal rather than noise.
The dataset consists of tweets annotated for the presence of enthymemes. For each enthymeme, annotators also propose a reconstruction of the implicit and explicit propositional content and argument structure. The tweets are a subset of the tropes dataset by Flaccavento et al. (2025), selected to balance two topics: immigration in the UK and the COVID-19 vaccine.
Train and dev sets are annotated by five annotators each; the test set by three. The data is released in three stages as outlined in the timeline above.
This task is interesting to anyone who works in text analysis, including researchers in natural language processing, argument mining, computational linguistics, misinformation detection, and social media analysis, from both academic and industrial settings.
We especially welcome interdisciplinary teams from argumentation theory, philosophy, rhetoric, communication studies, political science, and computational social science. Explicit structural modelling, linguistic feature-based approaches, and rule-based systems of all sorts are encouraged.
In addition to conventional benchmarking papers, participants are invited to submit "Quest for Insight" papers addressing a research question aimed at gaining deeper understanding of implicit argumentation. Full task results are not required for these papers. Example questions include:
implicit_premise, implicit_conclusion, or none(implicit) in the data)