DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base. It is meant for evaluating retrieval systems that return a ranked list of entities (DBpedia URIs) in response to a free text user query.

The first version of the collection (DBpedia-Entity v1) was released in 2013, based on DBpedia v3.7 [1]. It was created by assembling search queries from a number of entity-oriented benchmarking campaigns and mapping relevant results to DBpedia. An updated version of the collection, DBpedia-Entity v2, has been released in 2017, as a result of a collaborative effort between the IAI group of the University of Stavanger, the Norwegian University of Science and Technology, Wayne State University, and Carnegie Mellon University [2]. It has been published at the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17), where it received a Best Short Paper Honorable Mention Award. See the paper and poster.

Knowledge base

The test collection is based on DBpedia version 2015-10, specifically on the English subset. We require entities to have both a title and abstract (i.e., rdfs:label and rdfs:comment predicates)–this effectively filters out category, redirect, and disambiguation pages. Note that list pages, on the other hand, are retained. In the end, there are 4.6 million entities, each uniquely identified by its URI. We use a simplified prefixed format: http://dbpedia.org/resource/Albert_Einstein => <dbpedia:Albert_Einstein>.

Queries

The collection consists of a set of heterogeneous entity-bearing queries, assembled from various benchmarking campaigns (see the paper for details). Queries are categorized into four groups:

Category Description Examples
SemSearch_ES Named entity queries “brooklyn bridge”, “08 toyota tundra”
INEX-LD IR-style keyword queries “electronic music genres”
QALD2 Natural language questions “Who is the mayor of Berlin?”
ListSearch Queries that seek a particular list of entities “Professional sports teams in Philadelphia”

All queries are prefixed with the name of the originating benchmark. SemSearch_ES, INEX-LD, and QALD2 each correspond to a separate category; the rest of the queries belong to the ListSearch category.

Relevance judgments

Relevance judgments are collected using crowdsourcing. To ensure high quality, we obtained further expert annotations for cases with substantial disagreement. In total, over 49K query-entity pairs are labeled using a three-point scale (0: irrelevant, 1: relevant, and 2: highly relevant).

Files

The DBpedia-Entity v2 collection can be found under collection/v2 and is organized as follows:

This repository also contains the DBpedia-Entity v1 collection, which was built based on DBpedia version 3.7. The collection can be found under collection/v1 and is organized similar to the v2 version. There are, however, 3 qrels file for DBpedia-Entity v1:

Baseline rankings

The runs folder contains a set of baseline rankings (“runs”) in TREC format (the details of the indices used for generating these runs are described here).

Model SemSearch ES INEX-LD ListSearch QALD-2 Total
@10@100 @10@100 @10@100 @10@100 @10@100
BM25 0.24970.4110 0.02770.3612 0.21990.3302 0.27510.3366 0.25580.3582
PRMS 0.53400.6108 0.35900.4295 0.36840.4436 0.31510.4026 0.39050.4688
MLM-all 0.55280.6247 0.37520.4493 0.37120.4577 0.32490.4208 0.40210.4852
LM 0.55550.6475 0.39990.4745 0.39250.4723 0.34120.4338 0.41820.5036
SDM 0.55350.6672 0.40300.4911 0.39610.4900 0.33900.4274 0.41850.5143
LM+ELR 0.55540.6469 0.40400.4816 0.39920.4845 0.34910.4383 0.42300.5093
SDM+ELR 0.55480.6680 0.41040.4988 0.41230.4992 0.34460.4363 0.42610.5211
MLM-CA 0.62470.6854 0.40290.4796 0.40210.4786 0.33650.4301 0.43650.5143
BM25-CA 0.58580.6883 0.41200.5050 0.42200.5142 0.35660.4426 0.43990.5329
FSDM 0.65210.7220 0.42140.5043 0.41960.4952 0.34010.4358 0.45240.5342
BM25F-CA 0.62810.7200 0.43940.5296 0.42520.5106 0.36890.4614 0.46050.5505
FSDM+ELR 0.65630.7257 0.43540.5134 0.42200.4985 0.34680.4456 0.45900.5408

Citation

If you are using this collection, please cite the following paper:

@inproceedings{Hasibi:2017:DVT,
 author =    {Hasibi, Faegheh and Nikolaev, Fedor and Xiong, Chenyan and Balog, Krisztian and Bratsberg, Svein Erik and Kotov, Alexander and Callan, Jamie},
 title =     {DBpedia-Entity V2: A Test Collection for Entity Search},
 booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series =    {SIGIR '17},
 year =      {2017},
 pages =     {1265--1268},
 doi =       {10.1145/3077136.3080751},
 publisher = {ACM}
}

If possible, please also include the http://tiny.cc/dbpedia-entity URL in your paper.

Acknowledgments

This research was partially supported by the Norwegian Research Council, National Science Foundation (NSF) grant IIS-1422676, Google Faculty Research Award, and Allen Institute for Artificial Intelligence Student Fellowship. We Thank Saeid Balaneshin, Jan R. Benetka, Heng Ding, Dario Garigliotti, Mehedi Hasan, Indira Kurmantayeva, and Shuo Zhang for their help with creating relevance judgements.

Contact

In case of questions, feel free to contact f.hasibi@cs.ru.nl or krisztian.balog@uis.no.


[1] Krisztian Balog and Robert Neumayer. 2013. “A Test Collection for Entity Search in DBpedia”, In proceedings of 436th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR ’13). 737-740.

[2] Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. “DBpedia-Entity v2: A Test Collection for Entity Search”, In proceedings of 40th ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’17). 1265-1268.