Long (HTML | Doc) · Short (HTML | Doc) · NeurIPS (HTML | Doc)

IRC-Bench: Recognizing Entities from Contextual Cues
in First-Person Reminiscences

Alexander Apartsin1, Eden Moran2, Yehudit Aperstein2
1School of Computer Science, Faculty of Sciences, HIT-Holon Institute of Technology, Holon 58102, Israel
2Intelligent Systems, Afeka Academic College of Engineering, Tel Aviv 69988, Israel

Abstract

People sharing personal reminiscences routinely reference entities through contextual cues alone, without explicit naming. We extend implicit entity recognition from short social-media text [17, 18] to long-form reminiscence narratives, where entity cues are distributed across multiple clauses. We release IRC-Bench, a benchmark of 25,136 samples derived from 12,337 Wikidata-linked entities across 1,994 oral history transcripts spanning 11 thematic domains. We formalize the non-locality property of implicit references: recognition cues are distributed across non-contiguous clauses, distinguishing this task from NER, entity linking, and coreference resolution. We evaluate 19 configurations spanning open-world LLM generation, closed-world dense retrieval, hybrid RAG, and fine-tuning. QLoRA-adapted Llama 3.1 8B achieves 38.94% exact match (51.59% Jaccard), while fine-tuned DPR reaches 35.38% Hit@1 and 71.49% Hit@10. Chain-of-thought prompting consistently degrades performance, and RAG underperforms direct LLM inference.

Keywords: implicit entity recognition, IRC-Bench, reminiscence narratives, non-locality, large language models, dense passage retrieval, QLoRA

1. Introduction

Reminiscence, the act of recalling and sharing personal memories, plays a central role in human social life. In clinical settings, reminiscence therapy reduces depression in older adults [6], while in archival contexts, recorded reminiscences preserve cultural knowledge [7]. A defining characteristic of reminiscence narratives is that speakers reference people, places, and events through contextual cues rather than explicit naming, trusting the listener to fill in the gaps. This implicit referencing creates a fundamental challenge for automated systems that seek to index, search, or analyze these narratives.

Consider the following passage from a Japanese American reminiscence:

Entity-Grounded Narrative (EGN)

"The attack on Pearl Harbor was the event that changed everything for Japanese Americans like me. After December 7, 1941, suspicion and hatred grew, and we were treated as enemy aliens despite being American citizens. It was because of Pearl Harbor that the government issued Executive Order 9066 and started the forced relocation."

Entity-Elided Narrative (EEN)

"The surprise attack on a naval base in Hawaii was the event that changed everything for Japanese Americans like me. After December 7, 1941, suspicion and hatred grew, and we were treated as enemy aliens despite being American citizens. It was because of that attack that the government issued an order and started the forced relocation."

Gold entity: Attack on Pearl Harbor (Q52418)  |  Cues: December 7, 1941; naval base in Hawaii; Executive Order 9066; forced relocation

A human reader identifies the Attack on Pearl Harbor from the constellation of cues: the date, the Hawaiian naval base, the executive order, the internment. No single phrase names the entity; recognition depends on integrating knowledge distributed across the entire passage. This phenomenon falls between existing NLP tasks. Named Entity Recognition (NER) identifies explicit spans [1, 2]. Entity Linking (EL) resolves those spans to knowledge base entries [3, 4]. Coreference resolution requires at least one explicit mention as antecedent [5]. In implicit entity references, the entity is never named; there is no span to extract, no mention to link, no antecedent to resolve.

Our contributions are: (1) We extend implicit entity recognition from tweets [17, 18] to long-form reminiscence narratives and formalize the non-locality property of implicit references. (2) We release IRC-Bench, a benchmark of 25,136 samples from 12,337 Wikidata-linked entities across 1,994 transcripts, with entity-level train/dev/test splits ensuring zero entity overlap. (3) We systematically compare 19 configurations spanning four paradigms (generative LLM, dense retrieval, RAG, fine-tuning), revealing that fine-tuning doubles performance, chain-of-thought reasoning degrades accuracy, and model scale is the dominant factor in open-world performance.

Limited prior work has addressed entities that are referenced but never named. Hosseini [17] introduced implicit entity recognition in tweets, constructing a dataset of 3,119 tweets. Hosseini and Bagheri [18] developed learning-to-rank methods for this Twitter dataset. Perera et al. [19] explored implicit entity recognition in clinical documents. The coreference resolution community has studied "zero anaphora" and bridging references [20, 21], where entities are referenced indirectly. Computational approaches to reminiscence have focused on therapy systems using conversational agents [8, 9] and oral history processing for transcription and search [10, 11]; none address the challenge of identifying implicitly referenced entities.

Our work differs from prior formulations in several ways. First, reminiscence narratives are extended first-person accounts (50 to 200 words per sample) with rich, diffuse contextual cues, unlike short tweets or templated clinical notes. Second, we formalize and empirically demonstrate the non-locality property: recognition requires integrating multiple non-contiguous cues. Third, IRC-Bench contains 25,136 samples spanning 12,337 entities across 11 domains, compared to 3,119 tweet samples in Hosseini [17]. Fourth, we introduce entity-level splitting with zero overlap, ensuring models must generalize to unseen entities. Fifth, we evaluate 19 configurations spanning four paradigms, whereas prior work evaluated at most two to three approaches.

3. Dataset Construction

IRC-Bench is constructed through a four-stage automated pipeline that transforms oral history transcripts into implicit entity recognition samples. The pipeline leverages GPT-4.1-mini for entity extraction, summary generation, and implicit rewriting, producing 25,136 benchmark samples spanning 12,337 unique entities from 1,994 transcripts across 11 thematic domains (Table 1).

Stage 1: Transcript Cleaning. Raw oral history transcripts are cleaned and converted to structured JSON, preserving the first-person narrative voice. Stage 2: NER. GPT-4.1-mini identifies named entities of seven types (Place, Organization, Person, Event, Work, Military Unit, Other), linking each to Wikidata. This produces 31,284 mentions across 1,752 files. Stage 3: Explicit Summary. For each (transcript, entity) pair, GPT-4.1-mini generates a first-person narrative preserving contextual cues. Stage 4: Implicit Rewriting. Each summary is rewritten to remove all direct entity mentions while preserving contextual cues, producing the final 25,136 implicit samples.

Table 1: Source collections for IRC-Bench.

Collection Transcripts Description
Veterans517Military service narratives (Library of Congress VHP, others)
Immigration402Immigration experiences (Univ. Minnesota, Densho)
Regional314Regional histories (UNR, Kentucky Oral History)
Depression Era213Great Depression (Federal Writers' Project)
Japanese American156Internment and post-war (Densho)
Academic153Academic histories (Columbia, Smithsonian)
September 11729/11 experiences (NPS Memorial)
Civil Rights68Civil rights movement (Library of Congress)
COVID-19 / Labor / Refugee99Pandemic, labor, and refugee experiences
Total1,99411 thematic domains, 25+ institutional archives

The dataset is split at the entity level (70/10/20 ratio, seed=42), ensuring all samples for a given entity appear in exactly one partition (Table 2). Places dominate (47.3%), followed by Organizations (21.3%) and Persons (13.7%).

Table 2: IRC-Bench partition statistics.

PartitionSamplesEntities
Train17,9718,635
Dev2,5321,234
Test4,6332,468
Total25,13612,337
IRC-Bench construction pipeline

Figure 1: IRC-Bench construction pipeline. Raw oral history transcripts undergo cleaning, NER with Wikidata linking, entity-grounded narrative generation, and entity elision to produce implicit entity recognition samples.

4. Methodology

4.1 Task Formulation

Given a first-person narrative text \(t\) that implicitly references a named entity \(e\) without ever mentioning \(e\) by name, the task is to identify \(e\). We evaluate under two formulations. In the open-world setting, the model generates the entity name as free-form text without access to a candidate set. In the closed-world setting, the model ranks all 12,337 entities in the knowledge base by relevance to the query.

4.2 The Non-Locality Property

We define a key structural property distinguishing implicit entity recognition from span-based tasks. Let \(C(T, e^*) = \{c_1, c_2, \ldots, c_n\}\) denote the set of textual cues in \(T\) that collectively identify \(e^*\). In NER and EL, the entity is localized in a contiguous span. In implicit entity recognition, the entity is non-local: no single contiguous substring suffices, but the distributed cues collectively determine the entity. We validate this empirically: GPT-4o zero-shot accuracy on full implicit texts reaches 33.5%, while single-sentence accuracy drops to 12.9% (n=200), confirming that recognition requires integrating cues distributed across the entire passage.

4.3 Open-World Methods

We evaluate LLMs in a generative setting with zero-shot (ZS), few-shot (FS, 5 demonstrations), and chain-of-thought (CoT) prompting. Models include GPT-4o [23], GPT-4.1-mini, and Llama 3.1 8B Instruct [22]. Direct prompting uses temperature 0.0; CoT uses temperature 0.7 with 300 output tokens to accommodate the reasoning trace. We additionally fine-tune Llama 3.1 8B using QLoRA [13, 14] with 4-bit NF4 quantization, LoRA rank 16, alpha 32, and 2 epochs on the train split. The entity-level split guarantees the model cannot memorize entity-specific patterns.

4.4 Closed-World Methods

We encode both implicit query text and entity representations into a shared embedding space, ranking by cosine similarity. We explore three entity representations: Name (entity name alone), Description (name + LLM-generated description), and Wiki (first Wikipedia sentence). We evaluate BAAI/bge-base-en-v1.5 [15] off-the-shelf and fine-tuned via Dense Passage Retrieval (DPR) [16] with Multiple Negatives Ranking Loss (3 epochs, batch size 48, learning rate 2e-5).

4.5 RAG Baseline

A hybrid pipeline retrieves top-5 candidates via BGE-base, then GPT-4.1-mini reranks and selects the most likely entity. This tests whether LLM reranking can improve over pure embedding retrieval.

4.6 Evaluation Protocol

We employ a four-tier matching hierarchy: (1) exact string match, (2) alias match via Wikidata, (3) containment match (substring), and (4) Jaccard similarity ≥ 0.5. The primary open-world metric is alias-aware accuracy (Tiers 1+2). Closed-world experiments report Hit@K, MRR, and alias-aware Hit@1. All key comparisons use McNemar's test for significance.

5. Results

5.1 Open-World Performance

Table 3 presents the open-world results. QLoRA-adapted Llama 3.1 8B (O10) achieves the highest exact match at 38.94%, substantially outperforming all non-fine-tuned methods. Among direct prompting, GPT-4o few-shot (O2) is strongest at 31.62% exact match. Model scale is the dominant zero-shot factor: Llama 3.1 8B (13.92%) to GPT-4.1-mini (25.71%) to GPT-4o (27.02%). Few-shot prompting consistently improves all models (p < 0.001).

QLoRA fine-tuning nearly triples the base Llama's zero-shot performance (13.92% to 38.94%) and exceeds GPT-4o few-shot by 7.32 pp, despite the entity-level split ensuring zero overlap with training entities. At the Jaccard level, O10 reaches 51.59%, meaning more than half of predictions are at least partially correct.

Chain-of-thought prompting degrades all models: GPT-4o drops from 33.30% (ZS alias) to 23.89%, GPT-4.1-mini from 25.71% (ZS exact) to 18.93%, and Llama 3.1 8B from 13.92% to 6.22%. Temperature control experiments confirm CoT structurally harms smaller models, while for GPT-4o, controlling temperature recovers parity with zero-shot but does not exceed it. The RAG approach (19.71% exact) underperforms even GPT-4.1-mini zero-shot (25.71%).

Table 3: Open-world results on IRC-Bench test set (n=4,633). Best result in each column is highlighted.

ID Model Mode Exact (%) Alias (%) Contain (%) Jaccard (%)
O1GPT-4oZero-shot27.0233.3033.3035.05
O2GPT-4oFew-shot31.6238.9438.9441.10
O3GPT-4.1-miniZero-shot25.7127.0933.5035.94
O4GPT-4.1-miniFew-shot28.6636.8936.8939.48
O5Llama 3.1 8BZero-shot13.9214.8119.4720.18
O6Llama 3.1 8BFew-shot17.8318.8024.6125.66
O10Llama 3.1 8B (QLoRA)Fine-tuned38.9441.4247.9051.59
O11GPT-4.1-mini CoTt=0.718.9320.2726.4827.69
O12GPT-4o CoTt=0.722.5123.8930.9132.33
O13Llama 3.1 8B CoTt=0.76.226.6911.7212.24
RAG1BGE + GPT-4.1-miniRAG19.7120.5328.7529.55
Main results comparison across all methods

Figure 2: Comparison of open-world and closed-world methods on IRC-Bench. Open-world methods measured by exact match and alias-aware accuracy; closed-world methods by Hit@1 and Hit@10.

5.2 Closed-World Performance

Table 4 shows closed-world retrieval results. Fine-tuned DPR with descriptions (C5) achieves 35.38% Hit@1, 71.49% Hit@10, and 0.4751 MRR. With alias-aware evaluation, C5 reaches 42.80% Hit@1. DPR fine-tuning more than doubles Hit@1 for all entity representations: Name (16.51% to 30.00%), Description (16.64% to 35.38%), Wiki (14.38% to 27.95%). The largest absolute gain occurs for descriptions (+18.74 pp), indicating that fine-tuning is especially effective at aligning narrative cue structure with rich attribute content. Entity descriptions consistently outperform name-only and Wikipedia representations across both retrieval architectures.

Table 4: Closed-world retrieval results. Candidate set: 12,337 entities. Best in each column is highlighted.

ID Retriever Entity Repr. Hit@1 (%) Hit@3 (%) Hit@5 (%) Hit@10 (%) MRR Alias H@1 (%)
C1BGE (off-the-shelf)Name16.5126.3830.9736.760.236222.08
C2BGE (off-the-shelf)Description16.6427.7833.4140.600.248021.78
C3BGE (off-the-shelf)Wiki14.3825.1029.9237.320.221119.32
C4DPR (fine-tuned)Name30.0046.3653.6663.310.413137.10
C5DPR (fine-tuned)Description35.3853.5161.8271.490.475142.80
C6DPR (fine-tuned)Wiki27.9544.9851.8259.550.385134.38

5.3 Key Findings

Finding 1: Fine-tuning is the most impactful intervention. QLoRA raises exact match from 13.92% to 38.94% (2.80x). DPR fine-tuning raises Hit@1 from 16.64% to 35.38% (2.13x). Both achieve these gains despite zero entity overlap between train and test sets.

Finding 2: Chain-of-thought degrades all models. CoT reduces GPT-4o from 33.30% to 23.89% (alias), GPT-4.1-mini from 25.71% to 18.93% (exact), and Llama 8B from 13.92% to 6.22%. Temperature control experiments confirm this is structural for smaller models; for GPT-4o, controlling temperature recovers parity with zero-shot but does not exceed it.

Finding 3: RAG underperforms direct LLM inference. RAG1 (19.71%) is 8.95 pp below GPT-4.1-mini few-shot (28.66%). The retrieval bottleneck (gold entity in top-5 only 33% of the time) is the limiting factor.

Finding 4: Model scale dominates zero-shot performance. GPT-4o ZS (27.02%) outperforms Llama 8B ZS (13.92%) by 13.10 pp (p < 0.001). The retriever's Hit@10 of 71.49% reveals strong latent signal for future DPR+LLM reranking approaches.

6. Discussion

The failure of chain-of-thought prompting is the most counterintuitive finding. CoT improves mathematical reasoning and multi-hop QA by decomposing complex problems [12], yet it degrades implicit entity recognition. The explanation lies in the gestalt nature of the task: identifying an implicit entity requires simultaneously attending to a constellation of distributed cues and matching this constellation against parametric knowledge. When forced to reason step by step, models fixate on individual cues in isolation, arriving at locally plausible but globally incorrect entities.

The RAG pipeline underperforms direct generation because dense retrievers encode the EEN as a single vector, losing fine-grained cue information. Retrieved candidates are topically related but often incorrect, and when presented as context they can override the model's own correct intuition.

The success of QLoRA despite zero entity overlap indicates that fine-tuning teaches transferable skills: the task format (extracting a single canonical name), cue integration patterns (which temporal, spatial, and relational combinations are diagnostic), and entity type priors (calibrating expectations to reduce wrong-type errors). The model learns "how to solve implicit entity puzzles" rather than memorizing specific answers.

Benchmark calibration validates the dataset's difficulty: an automated quality assessment (n=500, GPT-4o judge) found 42% of samples recoverable by informed judgment, closely matching the best system's alias accuracy (41.4%), suggesting top models approach the practical ceiling imposed by available cues. EEN naturalness averaged 4.87/5.

Limitations include the English-only, American-focused scope; LLM-generated entity elision (though validated); alias-aware evaluation that still penalizes unregistered surface forms; and QLoRA's max_seq=192 tokens truncating approximately 6% of prompts. Extended analyses including per-entity-type breakdowns, error classification, and significance tests are available in the supplementary materials.

7. Conclusion

We have extended implicit entity recognition to long-form reminiscence narratives, formalizing the non-locality property and releasing IRC-Bench with 25,136 samples from 12,337 Wikidata-linked entities across 1,994 oral history transcripts. Our evaluation of 19 configurations reveals that fine-tuning is the most impactful intervention (QLoRA achieving 38.94% exact match, nearly tripling the base model), chain-of-thought prompting consistently degrades performance, and RAG underperforms direct inference due to the non-locality of implicit cues. Future work should explore combining DPR shortlists with LLM reranking (leveraging C5's 71.49% Hit@10), cross-lingual benchmarks, and architectures that explicitly model distributed cue structure.

References

[1] Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3-26.

[2] Li, J., Sun, A., Han, J., and Li, C. (2022). A survey on deep learning for named entity recognition. IEEE TKDE, 34(1):50-70.

[3] Ganea, O.-E. and Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. In Proc. EMNLP, pages 2619-2629.

[4] Kolitsas, N., Ganea, O.-E., and Hofmann, T. (2018). End-to-end neural entity linking. In Proc. CoNLL, pages 519-529.

[5] Lee, K., He, L., Lewis, M., and Zettlemoyer, L. (2017). End-to-end neural coreference resolution. In Proc. EMNLP, pages 188-197.

[6] Boyd, D. A. (2012). Achieving the promise of oral history in a digital age. In Ritchie, D. A., editor, The Oxford Handbook of Oral History. Oxford University Press.

[7] Lazar, A., Demiris, G., and Thompson, H. (2016). Evaluation of a multifunctional technology system in a memory care unit. Informatics for Health and Social Care, 41(4):373-389.

[8] Subramaniam, P. and Woods, B. (2012). The impact of individual reminiscence therapy for people with dementia. Expert Review of Neurotherapeutics, 12(5):545-555.

[9] Nikitina, S., Callaioli, S., and Baez, M. (2018). Smart conversational agents for reminiscence. Proc. 1st Intl. Workshop on Software Engineering for Cognitive Services, 52-57.

[10] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. In Proc. NAACL, pages 260-270.

[11] Pessanha, F. and Akdag Salah, A. (2022). A computational look at oral history archives. ACM JOCCH, 15(1):6:1-6:16.

[12] Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS.

[13] Hu, E. J., Shen, Y., Wallis, P., et al. (2022). LoRA: Low-rank adaptation of large language models. In Proc. ICLR.

[14] Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized language models. In Proc. NeurIPS.

[15] Xiao, S., Liu, Z., Zhang, P., et al. (2023). C-Pack: Packaged resources to advance general Chinese embedding. arXiv:2309.07597.

[16] Karpukhin, V., Oguz, B., Min, S., et al. (2020). Dense passage retrieval for open-domain question answering. In Proc. EMNLP, pages 6769-6781.

[17] Hosseini, H. (2022). Implicit entity recognition and linking in tweets. PhD thesis, Toronto Metropolitan University.

[18] Hosseini, H. and Bagheri, E. (2021). Learning to rank implicit entities on Twitter. Information Processing & Management, 58(3):102503.

[19] Perera, N., Dehmer, M., and Emmert-Streib, F. (2020). Named entity recognition and relation detection for biomedical information extraction. Frontiers in Cell and Dev. Biology, 8:673.

[20] Hou, Y. (2020). Bridging anaphora resolution as question answering. In Proc. ACL, pages 1428-1438.

[21] Poesio, M., Stuckardt, R., and Versley, Y. (2016). Anaphora Resolution: Algorithms, Resources, and Applications. Springer.

[22] Dubey, A., et al. (2024). The Llama 3 herd of models. arXiv:2407.21783.

[23] OpenAI (2024). GPT-4o system card. Technical Report.

[24] Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. NeurIPS.

[25] Wu, L., Petroni, F., Josifoski, M., Riedel, S., and Zettlemoyer, L. (2020). Scalable zero-shot entity linking with dense entity retrieval. In Proc. EMNLP, pages 6397-6407.