ir

1. 数据集介绍

表1： BEIR 数据集介绍

Dataset	Website	BEIR-Name	Queries	Corpus	Rel D/Q	Download
MSMARCO	Homepage	`msmarco`	6,980	8.84M	1.1	Link
TREC-COVID	Homepage	`trec-covid`	50	171K	493.5	Link
NFCorpus	Homepage	`nfcorpus`	323	3.6K	38.2	Link
NQ	Homepage	`nq`	3,452	2.68M	1.2	Link
HotpotQA	Homepage	`hotpotqa`	7,405	5.23M	2.0	Link
FiQA-2018	Homepage	`fiqa`	648	57K	2.6	Link
ArguAna	Homepage	`arguana`	1,406	8.67K	1.0	Link
Touche-2020	Homepage	`webis-touche2020`	49	382K	19.0	Link
CQADupstack	Homepage	`cqadupstack`	13,145	457K	1.4	Link
Quora	Homepage	`quora`	10,000	523K	1.6	Link
DBPedia	Homepage	`dbpedia-entity`	400	4.63M	38.2	Link
SCIDOCS	Homepage	`scidocs`	1,000	25K	4.9	Link
FEVER	Homepage	`fever`	6,666	5.42M	1.2	Link
Climate-FEVER	Homepage	`climate-fever`	1,535	5.42M	3.0	Link
SciFact	Homepage	`scifact`	300	5K	1.1	Link

数据集目录结构:

BEIR-Name/ 
├── qrels/ 
│ └── test.tsv  #  query-id    corpus-id   score
├── corpus.jsonl  # {"_id": , "title": ,"text": , "metadata": } 
└── queries.jsonl  # {"_id": , "text": , "metadata": }

数据集可根据给出的链接进行下载；如果数据没下载，在执行下面命令时，会自动下载数据集并解压，但这时 --dataset 后面得直接跟具体的BEIR-Name，不能 data/BEIR-Name，否则会报错。

python eval_beir.py \
    --model_name_or_path facebook/contriever \
    --dataset BEIR-Name  \
    --per_gpu_batch_size 768

contriever提供了多个预训练模型，但我们只要 fackbook/contriever 的推理结果即可。

2. 评价指标

信息检索的评价指标，包括Recall、Accuracy、Precision、MAP、MRR、NDCG等，在contriever中主要关注 Recall 和 NDCG，并针对 Top-k 个结果进行计算，即以 Recall@k 和 NDCG@k 作为评价指标。

给定一组查询 \mathcal{Q}=\{q_1, q_2, \ldots, q_n\}，和一组文档 \mathcal{D}=\{d_1, d_2, \ldots, d_m\}。对于每个查询 q，模型会计算其和 \mathcal{D} 中每个文档的相关度值，即 s(q,\mathcal{D})=\{ s(q, d_1), s(q, d_2), \ldots, s(q, d_m) \}

Recall@k：

定义：对于给定的查询 q，我们以 s(q,\mathcal{D}) 中得分最高的前 k 个文档作为检索结果，\text{Recall@}k 旨在计算检索结果中和查询 q 相关的文档数，和文档集 \mathcal{D} 中和查询 q 相关的文档数的比值。
若以\# \text{ rel}_{q} 作为 \mathcal{D} 中所有和查询 q 相关的文档；以 \# \text{ retr}_{q,k} 作为检索结果中，真正和查询 q 相关的文档，则可得如下公式：

\text{Recall@}k=\frac{\# \text{ retr}_{q,k}}{\# \text{ rel}_{q}}

由于我们需要评估一组查询 \mathcal{Q}，故对单个查询 q 的 \text{Recall@}k 值进行求和取平均，得到最终公式如下所示：

\text{Recall@}k=\frac{1}{|\mathcal{Q}|}\sum_{q=1}^{|\mathcal{Q}|}\frac{\# \text{ retr}_{q,k}}{\# \text{ rel}_{q}}

代码示例:

def recall_k(qrels, results, k=100):
    """计算 Recall@k 指标，即在前 k 个检索结果中找到的相关文档占所有相关文档的比例

    Args:
        qrels (dict): 查询-文档相关性标签字典。
                      格式：{query_id: {corpus_id: score}, ...}
                      在 contriever 中由 GenericDataLoader 读取 qrels\test.tsv 得来

        results (dict): 查询-检索结果字典 
                        格式：{qid: {corpus_id: score}, ...}
                        在 contriever 中由 beir.retrieval.evaluation.EvaluateRetrieval.retrieve(corpus, queries) 得来

        k (int, optional): 考虑的前 k 个检索结果数量。 Defaults to 100.

    Returns:
        float: 所有查询的平均 Recall@k 值。
    """
    total_recall, num_queries = 0.0, 0

    for qid, rels in qrels.items():
        if qid not in results:
            continue

        true_relevant_docs = set(
            docid for docid, score in rels.items() if score > 0
        )
        if not true_relevant_docs:
            continue

        pred_docs_at_k = sorted(
            results[qid].items(), 
            key=lambda item: item[1],  # 根据 q 和 corpus 的 score 降序排列
            reverse=True
        )[:k]
        pred_docs_set = set(docid for docid, _ in pred_docs_at_k)

        hit = pred_docs_set.intersection(true_relevant_docs)
        recall = len(hit) / len(true_relevant_docs)

        total_recall += recall
        num_queries += 1

    return round(total_recall / num_queries, 5) if num_queries > 0 else 0.0

NDCG@k：全称为 Normalize Discounted Cumulative Gain，其公式如下：

\text{NDCG@}k=\frac{1}{|\mathcal{Q}|}\sum_{q=i}^{|\mathcal{Q}|}\frac{\text{DCG}_q\text{@}k}{\text{IDCG}_q\text{@}k}

Gain：给定查询 q 和文档集 \mathcal{D}， q 和 \mathcal{D} 中的每个文档 d_i 可评测出一个真实的相关性得分，即为Gain，这一组相关性得分通常用 rel=[gain_1, gain_2, \ldots, gain_m] 进行表示。
CG(Cumulative Gain)：将检索结果的相关性评分累加起来，不考虑检索结果的排序。如果指定 k，则以 s(q,\mathcal{D}) 中得分最高的前 k 个文档作为检索结果，并只累计它们的相关性评分。这里以 rel_i 作为查询 q 和检索结果中第 i 个文档的相关性分数

\text{CG@}k=\sum_i^k{rel_{i}}

DCG(Discounted Cumulative Gain)：对CG的一种改进，通过引入位置折扣因子\frac{1}{log_2(i+1)}来考虑检索结果的排序，给定 k 时，则有下式：

\text{DCG@} k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i+1)}

IDCG(Ideal Discounted Cumulative Gain):：最理想的检索结果，即检索结果为 rel 降序排列后的顺序，给定 k 时，则取降序排列后的 rel 的前 k 个 gain 进行计算，计算公式同 DCG。

代码示例:

def dcg_at_k(scores):
    return sum(sc / math.log2(i + 1) for i, sc in enumerate(scores, start=1))

def ndcg_k(qrels, results, k=10):
    total_ndcg, num_queries = 0.0, 0

    for qid, rels in qrels.items():
        if qid not in results:
            continue

        pred_docs_at_k = sorted(
            results[qid].items(),
            key=lambda x: (x[1], x[0]),  # 关键:对x[0](即doc_id)的排序,不然结果不一致
            reverse=True
        )[:k]
        pred_scores = [rels.get(docid, 0) for docid, _ in pred_docs_at_k]

        ideal_scores = sorted(rels.values(), reverse=True)[:k]
  
        dcg = dcg_at_k(pred_scores)
        idcg = dcg_at_k(ideal_scores)

        ndcg = dcg / idcg if idcg > 0 else 0
        total_ndcg += ndcg
        num_queries += 1

    return round(total_ndcg / num_queries, 5) if num_queries > 0 else 0.0

注：由于 s(q,\mathcal{D}) 中可能存在并列分数，但它们对应的相关性得分不一定相同，因此检索结果可能存在不同排序，这时 DCG 的计算结果也将不一致。故在 NDCG@k 实现过程中一个小 trick，即先对 doc_id 进行降序排列，确保 doc_id 顺序的一致性，这样实现的结果才会和官方结果一致

参考：

[1] 信息检索中的评价指标
[2] 谈谈NDCG的计算

3. 实验结果

contriever提供了多个预训练模型，这里使用的是在 CCnet 和 English Wikipedia 上无监督预训练得到的权重 facebook/contriever。

表2： 实验结果(RTX3090 24G, batch_size=768)

数据集信息					实验结果
Task	Domain	Dataset	Queries	Corpus	nDCG @10	Recall @100	时耗
Bio-Medical Information Retrieval (IR)	Bio-Medical	Trec-COVID	50	171K	27.4	3.7	~12m47s
Bio-Medical Information Retrieval (IR)	Bio-Medical	NFCorpus	323	3.6K	31.73	29.41	~23s
Question Answering (QA)	Wikipedia	NQ	3,452	2.68M	25.4	77.1	~1h46m
Question Answering (QA)	Finance	FiQA-2018	648	57K	24.50	56.19	~3m54s
Argument Retrieval	Misc.	ArguAna	1,406	8.67K	37.90	90.11	~40s
Duplicate-Question Retrieval	Quora	Quora	10,000	523K	83.49	98.71	~6m6s
Citation-Prediction	Scientific	SCIDOCS	1,000	25K	14.91	35.99	~2m25s
Fact Checking	Scientific	SciFact	300	5K	64.92	92.60	~30s

如果觉得文章对你有用，请随意赞赏

ir

https://blog.748511.xyz/archives/ir

作者

zlv007

发布于

2025-10-26

更新于

2025-10-26

许可协议

CC BY 4.0