Description
The success of clinical trials depends on the recruitment of patients who match strict inclusion criteria. The development of effective patient to clinical trial matching systems depends on benchmarking datasets that support systematic evaluation. Apart from resources that are created in English by the TREC Clinical trials tracks (2021-23), very limited corpora exist for other languages and cross-language settings, despite the need of automatic support for clinical trial recruitment being global. To address this gap, we combine machine translation with medical expert annotation to construct CTcl (Clinical Trials Cross Lingual retrieval), a cross-lingual evaluation benchmark for patient-clinical trial retrieval in seven languages. We benchmark the cross-lingual retrieval task using 14 large language (embedding) models. We showcase how our dataset can be used to evaluate the cross-lingual capability of the models for languages with varying resource availability.
This repository contains topic files translated into target languages. For evaluation resources (human judgments), code examples, and quickstart guide, please refer to our GitHub repository. CTcl re-uses the English language document corpus (of clinical trials) of TREC CT 2021. It can be downloaded directly from TREC CT 2021 Web page (https://www.trec-cds.org/2021.html), or accessed via ir_datasets python package.
This repository contains topic files translated into target languages. For evaluation resources (human judgments), code examples, and quickstart guide, please refer to our GitHub repository. CTcl re-uses the English language document corpus (of clinical trials) of TREC CT 2021. It can be downloaded directly from TREC CT 2021 Web page (https://www.trec-cds.org/2021.html), or accessed via ir_datasets python package.
| Date made available | 6 Feb 2026 |
|---|---|
| Publisher | Zenodo |
Cite this
- DataSetCite