BELMA
Mexican Legal AI Evaluation Benchmark
Open call · 2026 · v0.1

An open standard for evaluating artificial intelligence in Mexican law.

An open, reproducible and public standard for evaluating AI systems applied to Mexican law.

tasks × areas of law
§

Abstract

BELMA is an open initiative to build the first Mexican legal dataset annotated by the legal community. It convenes professors, researchers, practicing lawyers, in-house legal teams and students to define methodology, curate the corpus and annotate the tasks. The dataset, methodology, evaluation code and results will be public.

i.
Proposal

What is BELMA?

Four elements that operate together.
01

Open dataset

Mexican legal tasks annotated by legal professionals. Public access for research and evaluation.

02

Reproducible methodology

Rubrics and criteria documented so any team can run the benchmark and verify results independently.

03

Legal diversity

Multiple areas of Mexican law and a mix of tasks that prevents the benchmark from favoring any specific system.

04

Independent governance

Decisions are made by a Technical Committee of academia, bar and practice. Temis holds neither majority nor veto.

ii.
Motivation

Motivation

The problem of measuring AI in Mexican law.

Each provider presents their own metrics. Each firm tests in its own way. Buyers lack a standard to distinguish tools that work from those that simply communicate well.

“A public dataset, annotated by Mexican legal professionals, against which any system can be evaluated transparently and reproducibly.”

Academia lacks a common base on which to investigate language-model behavior on legal tasks in Mexican legal Spanish.

BELMA fills that gap: a public dataset, annotated by Mexican legal professionals, with methodology validated by an independent committee.

iii.
Trajectory

Project phases

Working trajectory.
I

Call and committee

Open registration. Formation of the Technical Advisory Committee with plural representation.

In progress
II

Methodology

Task taxonomy, annotation schema, evaluation rubrics, dataset access policy. Public comment before close.

Upcoming
III

Construction & annotation

Sourcing from public sources. Distributed annotation with double review and adjudication. Inter-annotator agreement reported.

Pending
IV

Validation & release

Reference-model evaluation. Release of dataset, technical paper and open leaderboard.

Pending
iv.
Principles

Principles

Three commitments that govern the initiative.
i.

Neutrality.

Methodological decisions, task selection and dataset curation are the Committee's responsibility. Temis holds neither majority nor veto.

ii.

Transparency.

Dataset, methodology, code and results are public. Temis publishes its own results regardless of leaderboard position.

iii.

Methodological rigor.

Double review, adjudication and inter-annotator agreement reporting. External review before the 1.0 release.

v.
Call

Who we're looking for

Four complementary profiles.
01

Academia & research

Professors, researchers and graduate students in law or computer science interested in evaluation methodology.

02

Firms & litigators

Lawyers with active practice in any area of Mexican law, contributing professional judgment on task difficulty and relevance.

03

In-house teams

Corporate legal areas that evaluate or use legal AI tools, with perspective on real-world utility criteria.

04

Students

Late-stage law students interested in research, legal tech or evaluation methodology.

vi.
Commitment

What you get

  • + 01Authorship on the technical paper accompanying the benchmark release, for lead annotators and committee members.
  • + 02Public record of your participation on the project site and in derivative publications, with institution and role.
  • + 03Verifiable participation certificate.
  • + 04No specific time commitment; the committee defines scope per phase.

Join the call.

Free registration, open to individuals and organizations at belma.org.mx.

belma.org.mx
REGISTRATION FORM
DISCLOSURE

Conflict-of-interest disclosure

BELMA is driven by Temis AI, Inc., providing infrastructure and technical secretariat during the bootstrap phase. Methodological decisions, task selection and dataset curation are the sole responsibility of the Technical Committee, in which Temis holds neither majority nor veto. The dataset, methodology, evaluation code and results —including Temis's— are public.