IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences

Abstract

When people share personal reminiscences, they routinely reference people, places, and events through contextual cues alone, assuming their audience can identify what is meant without explicit naming. This phenomenon is especially prevalent in reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social contexts. Building on prior work that established implicit entity recognition in short social-media text [21, 22], we extend this task to the reminiscence domain, where entity cues are distributed across multiple clauses rather than concentrated in a single short message. We release IRC-Bench (Implicit Reminiscence Context Benchmark), a benchmark of 25,136 samples derived from 12,337 unique Wikidata-linked entities across 1,994 reminiscence transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative (EGN) containing the entity name with an Entity-Elided Narrative (EEN) from which all explicit mentions have been removed; systems must recover the correct entity given only the EEN. Building on bridging-anaphora and zero-mention coreference literature, we operationalize the non-locality property of implicit references: recognition cues are distributed across multiple non-contiguous clauses, distinguishing the regime where no contiguous identifying span exists in the entire passage from standard named entity recognition, entity linking, and coreference resolution. We evaluate 19 experimental configurations spanning open-world LLM generation, closed-world dense retrieval, hybrid RAG, and fine-tuning approaches. QLoRA-adapted Llama 3.1 8B achieves the highest open-world exact match at 38.94% (51.59% Jaccard), while fine-tuned DPR with entity descriptions reaches 35.38% Hit@1 (42.80% alias-aware) and 71.49% Hit@10 in the closed-world setting. Chain-of-thought prompting consistently degrades performance across all models, and retrieval-augmented generation underperforms direct LLM inference. All data, code, and evaluation tools are publicly released.

Keywords: implicit entity recognition, IRC-Bench, reminiscence narratives, coreference resolution, non-locality, large language models, dense passage retrieval, QLoRA, benchmark, entity linking, Wikidata

1. Introduction

Reminiscence, the act of recalling and sharing personal memories, plays a central role in human social life. In clinical settings, reminiscence therapy has been shown to reduce depression and improve well-being in older adults [6, 7], while in archival contexts, recorded reminiscences preserve cultural and historical knowledge that would otherwise be lost [8]. A defining characteristic of reminiscence narratives is that speakers assume shared context with their audience: they reference people, places, and events through contextual cues rather than explicit naming, trusting the listener to fill in the gaps. This implicit referencing is natural in conversation but creates a fundamental challenge for automated systems that seek to index, search, or analyze these narratives.

Consider the following passage from a Japanese American reminiscence:

Entity-Grounded Narrative (EGN)

"The attack on Pearl Harbor was the event that changed everything for Japanese Americans like me. After December 7, 1941, suspicion and hatred grew, and we were treated as enemy aliens despite being American citizens. It was because of Pearl Harbor that the government issued Executive Order 9066 and started the forced relocation."

Entity-Elided Narrative (EEN)

"The surprise attack on a naval base in Hawaii was the event that changed everything for Japanese Americans like me. After December 7, 1941, suspicion and hatred grew, and we were treated as enemy aliens despite being American citizens. It was because of that attack that the government issued an order and started the forced relocation."

Gold entity: Attack on Pearl Harbor (Q52418) | Type: Event | Cues: December 7, 1941; naval base in Hawaii; Executive Order 9066; forced relocation

A human reader readily identifies the Attack on Pearl Harbor from the constellation of cues: the date, the Hawaiian naval base, the executive order, the internment of Japanese Americans. No single phrase names the entity; instead, recognition depends on integrating cultural, temporal, and historical knowledge distributed across the entire passage. This pattern of implicit entity reference is pervasive in reminiscence narratives, where speakers routinely allude to well-known people, places, and events without naming them, relying on shared background knowledge with their listener.

This phenomenon falls between existing NLP tasks without being addressed by any of them. Named Entity Recognition (NER) identifies explicitly mentioned entity spans in text [1, 2]. Entity Linking (EL) resolves those spans to knowledge base entries [3, 4]. Coreference resolution connects multiple references to the same entity but requires at least one explicit mention as an antecedent [5]. In implicit entity references, the entity is never named anywhere in the text; there is no span to extract, no mention to link, no antecedent to resolve. The task can be viewed as a form of zero-mention coreference: resolving a reference to an entity that has no surface realization in the text, only a distributed constellation of contextual cues.

While implicit entity recognition was first explored in short social-media text [21, 22], we extend it to a fundamentally different setting: long-form reminiscence narratives where entity cues are non-local, distributed across multiple clauses. We release IRC-Bench (Implicit Reminiscence Context Benchmark), a large-scale evaluation resource constructed from real reminiscence transcripts. This task addresses practical needs across multiple domains. Archives of personal reminiscences, including oral history collections containing millions of hours of recorded testimony, remain largely inaccessible to structured search because the entities discussed are rarely stated by name [9]. In healthcare, reminiscence therapy is a widely used intervention for older adults with dementia and depression [6, 7, 10]; automated systems that support these therapeutic conversations must identify the people and events being discussed even when the speaker does not name them. Social robotics and conversational AI for elderly companionship similarly require understanding implicit references to engage meaningfully with users' personal histories [11, 12]. More broadly, information retrieval over personal narratives requires understanding not just what is said, but what is meant.

Our contributions are as follows:

Operationalizing non-locality at scale. Building on bridging-anaphora and zero-mention coreference literature [42, 43], we operationalize the non-locality property of implicit references: a formal definition (no contiguous span in the passage suffices to identify the entity, yet the union of non-contiguous cues does), a per-sample sentence-ablation diagnostic (full-text 33.5% vs. single-sentence 12.9% accuracy under GPT-4o zero-shot, n=200), and benchmark-scale evidence (25,136 samples) that this regime causes systematic failures in standard NER, EL, RAG, and CoT pipelines.
IRC-Bench. We release IRC-Bench as a community evaluation resource: 25,136 implicit entity recognition samples derived from 12,337 unique Wikidata-linked entities sourced from 1,994 reminiscence transcripts across 11 thematic domains, with entity-level train/dev/test splits ensuring zero entity overlap between partitions. Each sample includes both an EGN and an EEN, along with entity metadata (QID, aliases, Wikipedia description). The release follows the NeurIPS Datasets and Benchmarks track convention; contributions (1) and (3) stand independently of the dataset release.
Comprehensive evaluation. We systematically compare 19 experimental configurations spanning open-world LLM inference (zero-shot, few-shot, chain-of-thought, QLoRA fine-tuning), closed-world dense retrieval (off-the-shelf and DPR fine-tuned), and hybrid RAG, revealing that fine-tuning doubles performance in both paradigms, chain-of-thought reasoning degrades performance on this task, and model scale is the dominant factor in open-world accuracy.

2.1 Named Entity Recognition

Named Entity Recognition identifies and classifies explicit entity mentions in text. Classical approaches relied on handcrafted features and conditional random fields [1], while modern systems employ deep learning architectures including BiLSTM-CRF [9], transformer-based sequence labeling [10], and large language model prompting [11, 12]. Recent benchmarks such as CoNLL-2003 [13] and MultiCoNER [14] have driven progress across entity types and languages. The W2NER framework [15] unified flat, nested, and discontinuous NER as word-word relation classification, and UniversalNER [16] demonstrated targeted distillation from LLMs for open-domain entity extraction. Despite these advances, all NER formulations assume the target entity appears as an explicit surface form in the input text, an assumption that does not hold for implicit references.

2.2 Entity Linking

Entity Linking resolves textual mentions to entries in a knowledge base. Neural approaches include local attention models [3], bi-encoder architectures such as BLINK [17], autoregressive generation via GENRE [18], and efficient zero-shot systems like ReFinED [19]. Botha et al. [20] extended entity linking to over 100 languages. These systems take an identified mention span as input and rank candidate entities; they cannot operate when no mention span exists. Implicit entity recognition requires generating entity candidates from distributed contextual cues rather than resolving a given span.

2.3 Reminiscence Analysis and NLP

Reminiscence, the structured recall of autobiographical memories, has been studied extensively in psychology and gerontology. Butler [38] first proposed life review as a therapeutic process, and subsequent work established reminiscence therapy as an evidence-based intervention for depression and cognitive decline in older adults [6, 7]. Webster [39] developed the Reminiscence Functions Scale, identifying eight distinct functions of autobiographical memory sharing. Computational approaches to reminiscence have focused primarily on two areas: reminiscence therapy systems and oral history processing. Therapy-oriented systems use conversational agents or social robots to elicit and respond to personal memories [11, 12, 40], while oral history processing addresses transcription, topic segmentation, and search [9, 41]. However, none of these systems address the fundamental challenge of identifying the entities that speakers reference implicitly. Our work bridges this gap by extending implicit entity recognition, previously studied only in short social-media text [21, 22], to the reminiscence domain and providing the first benchmark derived from real reminiscence narratives.

2.4 Implicit and Zero-Mention Entities

Limited prior work has addressed entities that are referenced but not named. Hosseini [21] introduced implicit entity recognition in tweets, constructing a dataset of 3,119 tweets with implicit entity mentions. Hosseini and Bagheri [22] developed learning-to-rank methods for this Twitter dataset. Perera et al. [23] explored implicit entity recognition in clinical documents. The coreference resolution community has studied "zero anaphora" and bridging references [42, 43], where an entity is referenced indirectly through related concepts.

Our work differs from these efforts in five fundamental ways. First, domain and text structure. Tweets are short (under 280 characters), formulaic, and heavily context-dependent on trending topics; clinical notes follow rigid templates. Reminiscence narratives are extended first-person accounts (typically 50 to 200 words per sample) with rich, diffuse contextual cues spanning dates, locations, personal relationships, sensory details, and historical events. Second, non-locality. In tweets, the implicit entity is typically inferable from a single cue or hashtag context. In reminiscence narratives, building on the bridging-anaphora and zero-mention coreference literature [42, 43] that has long noted indirect reference, we operationalize the non-locality property at scale: a formal definition (no contiguous span suffices to identify the entity, yet the union of non-contiguous cues does), a per-sample sentence-ablation diagnostic, and benchmark-scale evidence that this regime causes systematic failures in standard pipelines. Third, scale and diversity. IRC-Bench contains 25,136 samples spanning 12,337 unique Wikidata-linked entities across 11 thematic domains, compared to 3,119 tweet samples in Hosseini [21] covering primarily entertainment and sports entities. Fourth, entity-level evaluation. We introduce entity-level train/test splitting with zero entity overlap, ensuring that models must generalize to entirely unseen entities rather than memorizing entity-specific patterns. Prior benchmarks used random sample-level splits where the same entity could appear in both training and test data. Fifth, comprehensive method comparison. We systematically evaluate 17 configurations spanning four paradigms (generative LLM, dense retrieval, RAG, fine-tuning), whereas prior work evaluated at most two to three approaches on a single paradigm.

2.5 Oral History NLP

Computational analysis of oral histories has received growing attention. Technology-assisted reminiscence systems have been developed for dementia care [7, 8], and AI-driven conversational agents have been explored as companions for elderly users [24, 25]. Digital storytelling platforms combining AI with augmented reality enable communities to preserve personal narratives [26]. However, these systems primarily facilitate memory recall and do not attempt to recover the implicit entities that speakers reference without naming.

2.6 Knowledge-Grounded Question Answering

The closest existing task to implicit entity recognition is knowledge-grounded question answering, where a system must reason over both a text passage and an external knowledge base to produce an answer [27, 28]. Retrieval-augmented generation (RAG) approaches retrieve relevant knowledge base passages and condition generation on them [29, 30]. While implicit entity recognition shares the requirement for external knowledge, it differs in that the "question" is an entire narrative rather than a targeted query, and the answer is always a single entity rather than a free-form text span. Furthermore, implicit entity recognition exhibits the non-locality property: the relevant cues are distributed throughout the passage rather than concentrated near a question token. This structural difference, as we show empirically, causes standard RAG pipelines to underperform direct LLM inference.

2.7 Recent Developments (2022 to 2025)

Since the initial study of implicit entity recognition on social media [21, 22], four related threads have advanced the surrounding state of the art without directly addressing the long-form implicit-mention regime. Entity linking with LLMs. Direct prompting frameworks such as ChatEL [44] and LLM-based context augmentation for long-tail entities [45] reformulate entity linking around LLM reasoning, but both presume an explicit surface mention to resolve. Long-context entity tracking. Recent benchmarks stress-test entity and fact recall across thousands of tokens, including BABILong [46] for reasoning-in-a-haystack and RULER [47] for synthetic needle-in-a-haystack variants. The most directly relevant is NoLiMa [48], which shows that long-context retrieval performance collapses when lexical overlap with the query is removed; this is precisely the regime that IRC-Bench targets at the dataset level, with naturally occurring rather than synthetically removed lexical anchors. Coreference and bridging. The CRAC 2023 shared task [49] codifies bridging-adjacent annotation across multilingual corpora but stops short of zero-mention recognition where no antecedent span exists. Oral history NLP. Speech-technology pipelines for archival audio [50] and LLM-based topical and sentiment annotation of oral-history corpora [51] motivate the downstream entity-grounding task we address, none of these existing efforts attempt entity recovery under implicit reference. IRC-Bench occupies the intersection these threads converge on, long-context, lexically de-coupled, entity-grounded recognition over reminiscence narratives.

3. Dataset Construction

3.1 Overview

IRC-Bench is constructed through a four-stage automated pipeline that transforms oral history transcripts into implicit entity recognition samples. Each sample consists of a first-person narrative that references a named entity through contextual cues alone, without ever naming it. All pipeline stages use off-the-shelf GPT-4.1-mini via the OpenAI Batch API with deterministic decoding (temperature 0.0 for NER, 0.3 for summarization and rewriting); no fine-tuning is performed for corpus construction. The pipeline produces 25,136 benchmark samples spanning 12,337 unique entities. Verbatim prompts for all stages are released in Appendix A.

3.2 Source Collections

The raw data comprises 1,994 cleaned oral history transcripts drawn from 11 thematic collections. These collections provide broad topical diversity, covering military conflicts, social movements, immigration, public health crises, labor history, and academic life. The first-person narrative style of oral histories naturally provides rich contextual cues (dates, locations, relationships, roles, events) that make implicit entity references solvable for knowledgeable readers, while remaining challenging for automated systems.

Collection	Transcripts	Sources	Description
Veterans	517	Library of Congress VHP, Nevada WWII, Niles Library, Wisconsin Veterans Museum	Military service narratives
Immigration	402	University of Minnesota, Densho Digital Archive	Immigration and assimilation experiences
Regional	314	University of Nevada Reno, Kentucky Oral History Commission	Regional and community histories
Depression Era	213	Federal Writers' Project (Library of Congress)	Great Depression oral histories
Japanese American	156	Densho Digital Archive	Japanese American internment and post-war
Academic	153	Columbia University Oral History, Smithsonian Archives of American Art	Academic and university histories
September 11	72	National Park Service 9/11 Memorial	9/11 experiences and aftermath
Civil Rights	68	Civil Rights History Project (Library of Congress)	Civil rights movement narratives
COVID-19	42	Various oral history projects	Pandemic experiences
Labor	30	Labor Archives and Research Center	Labor movement histories
Refugee	27	Voices of Conscience, UNHCR collections	Refugee experiences
Total	1,994	11 thematic domains, 25+ institutional archives

3.3 Pipeline Stages

The benchmark construction proceeds in four stages, illustrated in Figure 1.

Stage 1: Transcript Cleaning. Raw oral history transcripts are cleaned and converted to structured JSON format, preserving the first-person narrative voice while removing interviewer questions and metadata artifacts.

Stage 2: NER and entity linking. GPT-4.1-mini extracts entities of seven types (Place, Organization, Person, Event, Work, Military Unit, Other) and, for each, emits the canonical English Wikipedia title (full prompt: Appendix A.6). Each title is resolved to a Wikidata QID using the Wikipedia pageprops.wikibase_item endpoint, with a wbsearchentities fallback on Wikidata for unmatched titles. The resolved QID supplies up to ten aliases (aliases.en) and the Wikidata short description; the first sentence of the matching Wikipedia article is also fetched as description_wiki. Of 12,337 unique entities in the final KB, 84.6% retain a verified Wikipedia URL, 100% retain a Wikidata QID, and 51.2% have at least one alias. This stage produces 31,284 entity mentions across 1,752 transcript files (87.9% coverage).

Stage 3: Explicit Summary Generation. For each (transcript, entity) pair, GPT-4.1-mini generates a first-person narrative summary focused on that entity, preserving the contextual cues surrounding the entity's mention (full prompt: Appendix A.7). This produces 25,161 explicit summaries from 1,601 transcript files (80.3% coverage).

Stage 4: Implicit Rewriting. For each summary from Stage 3, GPT-4.1-mini produces an entity-elided rewrite under a fixed prompt (Appendix A.8) with five hard rules: (i) remove the entity name and all obvious aliases; (ii) preserve first-person voice; (iii) preserve every contextual cue (dates, places, co-occurring events, roles, people); (iv) add no new information; (v) keep length to 3 to 5 sentences. Entity references are replaced with sentence-local descriptions inferred from the surrounding context (e.g., "Attack on Pearl Harbor" becomes "the attack on the naval base in Hawaii"). Replacement strings are never drawn from the KB. A subsequent string-match leakage check using each entity's Wikidata alias list filters samples where the name still appears verbatim, removing 25 of 25,161 candidates and yielding the final 25,136 implicit rewrites (80.2% coverage), which form the benchmark's implicit_text field.

3.4 Entity Knowledge Base

The entity knowledge base contains 12,337 unique entities with the following metadata coverage: 84.6% have associated Wikipedia pages, 100% have a Wikidata QID, and 51.2% have alternative names sourced from Wikidata. Entity short descriptions used as retrieval targets in the closed-world experiments are sourced from Wikidata (descriptions.en) for 84.6% of entities and from the matching Wikipedia first sentence for an additional 13.7%; for the remaining 1.7% of QID-holding entities with neither, GPT-4.1-mini emits a one-line description from the in-transcript context. These KB descriptions feed the closed-world retrieval baselines only; they never enter an EEN. Entity representations serve three roles: retrieval targets for closed-world experiments, alias sources for evaluation matching, and description inputs for embedding-based approaches.

3.5 Entity-Level Train/Dev/Test Splitting

To ensure rigorous evaluation, the dataset is split at the entity level rather than the sample level. All samples for a given entity appear in exactly one partition, preventing information leakage where a model might learn entity-specific patterns from training examples and exploit them at test time. The split uses a 70/10/20 ratio (seed=42).

Table 2: IRC-Bench partition statistics. Entity-level splits ensure zero overlap between train, dev, and test entities.

Partition	Samples	Entities
Train	17,971	8,635
Dev	2,532	1,234
Test	4,633	2,468
Total	25,136	12,337

3.6 Entity Type Distribution

The dataset exhibits a natural long-tail distribution over entity types. Places dominate (47.3%), reflecting oral histories' emphasis on geographic locations. Organizations (21.3%) and Persons (13.7%) follow, while specialized types such as Events, Works, and Military Units are less frequent but still well represented. Table 3 reports the full distribution.

Table 3: Distribution of IRC-Bench samples by entity type.

Entity Type	Samples	% of Total	Unique Entities
Place	11,893	47.3%	4,821
Organization	5,366	21.3%	2,894
Person	3,450	13.7%	2,207
Event	2,162	8.6%	1,102
Work	1,195	4.8%	743
Military Unit	537	2.1%	312
Other	533	2.1%	258
Total	25,136	100%	12,337

3.7 Example Samples

Figure 3 presents three EGN/EEN pairs illustrating the diversity of implicit references in IRC-Bench.

Example 1: Person

EGN

"Rosa Parks was arrested on December 5, 1955, in Montgomery, Alabama, for refusing to give up her bus seat, an act that sparked the Montgomery bus boycott. E. D. Nixon called me late that night to inform me of her arrest and to urge action."

EEN

"A woman was arrested on December 5, 1955, in Montgomery, Alabama, for refusing to give up her bus seat, an act that sparked the Montgomery bus boycott. A local leader called me late that night to inform me of her arrest and to urge action."

Gold: Rosa Parks (Q41921) | Cues: December 5 1955, Montgomery Alabama, bus seat refusal, bus boycott

Example 2: Event

EGN

"I headed the relief committee during the disastrous Berkeley Fire of 1923, helping to coordinate aid and recovery efforts for the community. This was a challenging time for Berkeley, California, and I took an active role in organizing support to help residents rebuild."

EEN

"I headed the relief committee during the disastrous fire of 1923 in a California city, helping to coordinate aid and recovery efforts for the community. This was a challenging time for the city, and I took an active role in organizing support to help residents rebuild."

Gold: Berkeley Fire of 1923 (Q4561337)

Example 3: Organization (5 cues)

EGN

"After leaving the Navy in 1966, I worked in the warehouse at Montgomery Ward in Redwood City. It was a non-union job and pretty low-key, just me and an older lady doing pricing and warehouse work."

EEN

"After leaving the Navy in 1966, I worked in the warehouse at a national department store in Redwood City. It was a non-union job and pretty low-key, just me and an older lady doing pricing and warehouse work."

Gold: Montgomery Ward (Q3046) | Cues: Navy 1966, warehouse, national department store, Redwood City, non-union

Figure 3: Three EGN/EEN pairs from IRC-Bench. Blue highlights mark explicit entity mentions in EGNs; red highlights show the elided descriptions in EENs. Examples span Event, Event, and Organization types, demonstrating how distributed cues (dates, locations, roles, institutions) jointly identify the entity.

3.8 Pipeline Validation and Difficulty Calibration

Before release, Stage-4 outputs are validated on a random sample of 500 test-partition EENs (10.8% of the 4,633-sample test set; 95% binomial CIs of ±4.0 pp at the mean). Each sample is scored by GPT-4o as a structured judge that receives both the entity-grounded narrative (EGN) and the entity-elided narrative (EEN), under a fixed four-dimensional rubric (full prompt: Appendix B.3):

Naturalness on a 1 to 5 Likert scale: 1 = very awkward or robotic; 3 = passable but noticeably stilted; 5 = completely natural first-person speech (intermediate values interpolated).
Leakage (binary, alias-aware): does the entity name or an obvious alias still appear verbatim in the EEN?
Cue sufficiency on 1 to 5: are there enough contextual cues in the EEN for a knowledgeable human to identify the entity?
Recoverability on yes / probably / unlikely / no: could the judge identify the entity from the EEN alone?

EEN naturalness averages 4.87 out of 5 (433 fives, 67 fours, none below four), confirming that Stage 4 produces fluent first-person text. Leakage was detected on 6.8% of samples (34/500); these are removed by the alias-string-match filter described in Stage 4. Cue sufficiency averages 3.0 out of 5, indicating moderate overall difficulty with substantial variance.

Recoverability and the cue-imposed ceiling. 42.0% of samples are judged recoverable (5.8% "yes," 36.2% "probably"), 7.2% "possible with expertise," and 50.8% "unlikely" or "no." The 42% rate matches the best system's alias-aware accuracy within 0.6 pp (O10 QLoRA: 41.4%; C5 DPR: 42.8% alias Hit@1), suggesting that top systems approach the practical ceiling imposed by the available cues rather than being bottlenecked by model capacity. The judge prompt and all 500 raw judgments are released for community re-validation.

4. Methodology

4.1 Task Formulation

Implicit entity recognition, the task of identifying entities that are contextually referenced but never explicitly named, was first studied by Hosseini [21] in the context of tweets. We adopt the same core objective and extend it to long-form reminiscence narratives: given a first-person narrative text $t$ that implicitly references a named entity $e$ without ever mentioning $e$ by name, the task is to identify $e$. The text $t$ contains contextual cues (dates, locations, events, people, roles, descriptions) that jointly constrain the identity of $e$, but the model must synthesize these cues and draw on world knowledge to produce the correct entity name.

We evaluate implicit entity recognition under two formulations:

Open-world formulation. The model generates the entity name as free-form text, without access to a candidate set. This tests the model's ability to recall entities from its parametric knowledge. The open-world setting is more realistic, as it does not assume a closed inventory of possible entities.

Closed-world formulation. The model ranks all 12,337 entities in the knowledge base by relevance to the query text, selecting the highest-ranked candidate. This tests the model's ability to match implicit descriptions to entity representations via embedding similarity. The closed-world setting provides Hit@K metrics and is analogous to entity linking with a fixed knowledge base.

4.2 The Non-Locality Property

Bridging anaphora [42] and zero-mention coreference [43] have long noted that entities can be referenced indirectly through related concepts; we strengthen this observation by formalizing the case where no contiguous identifying span exists in the entire passage. Let $C(T, e^*) = \{c_1, c_2, \ldots, c_n\}$ denote the set of textual cues in $T$ that collectively identify $e^*$. In standard NER and EL, the entity is localized: there exists a contiguous span $m$ that is sufficient to identify $e^*$. In implicit entity recognition, the entity is non-local:

$$\nexists \; m \subset T \;\text{s.t.}\; m \text{ is contiguous} \wedge m \Rightarrow e^*$$ $$\text{but}\; C(T,e^*) \Rightarrow e^*,\; c_i \text{ non-contiguous}$$

That is, no single contiguous substring of $T$ is sufficient to identify $e^*$, but the set of distributed cues collectively determines it. This non-locality has direct implications for method design: approaches that rely on local span matching (NER, EL) or single-vector passage encoding (dense retrieval) are structurally disadvantaged relative to approaches that can integrate information across the full text (LLMs with sufficient context windows).

We empirically validate non-locality by comparing GPT-4o zero-shot accuracy on full implicit texts versus individual sentences in isolation (n=200). Full-text accuracy reaches 33.5%, while single-sentence accuracy drops to 12.9%, a gap of 20.6 percentage points. This confirms that entity recognition requires integrating cues distributed across the entire passage; no single sentence carries sufficient information in the majority of cases.

4.3 Open-World Methods

4.3.1 LLM Generative Approach

We evaluate LLMs in a generative setting where each model receives the implicit text and must produce the entity name. All direct-prompting models use temperature 0.0 (greedy decoding) and a maximum of 100 output tokens. We test zero-shot (ZS) and few-shot (FS, 5 fixed demonstrations) prompting strategies. Few-shot exemplars are selected to cover diverse entity types and are held constant across all test samples. Complete prompt templates appear in Appendix A.

4.3.2 Models

We evaluate four LLM families in the open-world setting: GPT-4o [37] and GPT-4.1-mini (via OpenAI Batch API), and Llama 3.1 8B Instruct [35, 36] via OpenRouter API). For GPT-4o, GPT-4.1-mini, and Llama 3.1 8B, we additionally evaluate chain-of-thought (CoT) prompting, which instructs the model to reason step-by-step before producing the final answer. CoT experiments use temperature 0.7 and a maximum of 300 output tokens to accommodate the reasoning trace.

4.3.3 QLoRA Fine-tuning (O10)

We fine-tune Llama 3.1 8B Instruct using QLoRA (Quantized Low-Rank Adaptation) [33, 34] for implicit entity recognition. The model is trained to generate the entity name given the implicit text, using the standard causal language modeling objective. The entity-level splitting guarantees zero overlap between training and test entities, so the fine-tuned model cannot memorize entity-specific patterns; it must learn to generalize the implicit-to-entity mapping to entirely unseen entities. Key training parameters include 4-bit NF4 quantization, LoRA rank 16, alpha 32, learning rate 2e-4, and 2 epochs of training on the full train split (17,971 samples). Full hyperparameters are reported in Appendix B.

4.4 Closed-World Methods

In the closed-world setting, we encode both the implicit query text and all 12,337 entity representations into a shared embedding space, then rank entities by cosine similarity. We explore three entity representation strategies: Name (the entity name alone), Description (the entity name concatenated with its LLM-generated description), and Wiki (the first sentence from the entity's Wikipedia article). For entities lacking a description or Wikipedia text, we fall back to the next available representation.

4.4.1 BGE-base Baseline (C1, C2, C3)

We use BAAI/bge-base-en-v1.5 [31] as our baseline embedding model. This 110M-parameter model produces 768-dimensional embeddings and ranks among the top general-purpose bi-encoders on the MTEB benchmark. Embeddings are L2-normalized before computing cosine similarity.

4.4.2 DPR Fine-tuning (C4, C5, C6)

We fine-tune BGE-base using a Dense Passage Retrieval (DPR) approach [30] with Multiple Negatives Ranking Loss (MNRL). Each training pair consists of an implicit text (query) and its gold entity representation (positive passage). MNRL uses in-batch negatives: for a batch of $B$ query-positive pairs, each positive for one query serves as a negative for all other queries, providing $B-1$ negatives per sample without explicit hard negative mining. We train for 3 epochs with batch size 48 and learning rate 2e-5. Three separate models are trained, one for each entity representation strategy.

4.5 RAG Baseline (RAG1)

We implement a Retrieval-Augmented Generation (RAG) baseline that combines embedding retrieval with LLM reranking. The pipeline operates in two stages. First, BGE-base with entity descriptions (C2 configuration) retrieves the top-5 candidate entities for each implicit query. Second, GPT-4.1-mini receives the implicit text along with the 5 candidates (with their descriptions) and selects the most likely entity or suggests a better one. This approach tests whether an LLM can effectively rerank retrieved candidates to improve over pure embedding retrieval.

4.6 Leakage Prevention

Information leakage between training and evaluation is controlled at two levels. Dataset level. The benchmark uses an entity-level train/dev/test split (70/10/20, seed=42), verified by set-intersection of entity_list_train.txt, entity_list_dev.txt, and entity_list_test.txt: every pairwise intersection is exactly zero, and every sample for a given entity is assigned to a single partition. Model level. For prompted models (O1 to O8, O11 to O13), the presence of test entities in pretraining is not a leak: entity knowledge is the task itself; only IRC-Bench-specific surface patterns are protected by the split. For QLoRA (O10) and fine-tuned DPR (C4 to C6), the entity-level split guarantees that the gold entity for any test query never appears as a training target or positive. For off-the-shelf retrievers (C1 to C3, RAG1 retrieval), there is no IRC-Bench-specific training. At corpus-construction time, a Stage-4 string-match leakage check using each entity's Wikidata alias list filtered 25 of 25,161 candidate EENs (6.8% detection rate, 100% removed) before release. All splits, alias lists, and the leakage-check script are released with the benchmark.

5. Evaluation Protocol

5.1 Matching Hierarchy

Entity names can be expressed in multiple valid forms (e.g., "United States Marine Corps" vs. "USMC" vs. "Marines"). To account for this variation, we employ a four-tier matching hierarchy, applied in order of decreasing strictness:

Tier 1 (Exact match): The prediction and gold entity are identical after lowercasing and whitespace trimming.

Tier 2 (Alias match): The prediction matches one of the gold entity's known aliases from Wikidata. For example, predicting "NYC" for gold entity "New York City" is an alias match.

Tier 3 (Containment match): The prediction is a substring of the gold entity, or vice versa. For example, predicting "Pearl Harbor" for "Attack on Pearl Harbor" qualifies as a containment match.

Tier 4 (Jaccard match): The token-level Jaccard similarity between the prediction and gold entity is at least 0.5. This captures partial overlaps where the prediction includes most of the relevant tokens.

A prediction is considered correct at a given tier if it matches at that tier or any stricter tier. When reporting alias-aware accuracy (the primary metric for open-world experiments), we count any prediction that achieves Tier 1 or Tier 2 as correct.

5.2 Metrics

Open-world experiments report exact match (Tier 1), alias match (Tiers 1+2), containment match (Tiers 1+2+3), and Jaccard match (all four tiers). Closed-world experiments report Hit@K (K = 1, 3, 5, 10), Mean Reciprocal Rank (MRR), and alias-aware Hit@1 (where a hit counts if any alias of the gold entity appears in the top-K).

5.3 Statistical Significance

To assess whether performance differences between methods are statistically significant, we use McNemar's test (with continuity correction) on the paired per-sample outcomes from each pair of compared systems. Additionally, we compute bootstrap confidence intervals (1,000 resamples, seed=42) at the 95% level.

6. Results and Analysis

6.1 Open-World Performance

Table 4 presents the open-world results across all experimental configurations. The QLoRA-adapted Llama 3.1 8B (O10) achieves the highest exact match accuracy at 38.94%, substantially outperforming all other open-world methods. Among non-fine-tuned models, GPT-4o with few-shot prompting (O2) is the strongest at 31.62% exact match, rising to 41.10% under the full four-tier Jaccard evaluation.

Model scale is the dominant factor for zero-shot performance: moving from Llama 3.1 8B (13.92%) to GPT-4.1-mini (25.71%) to GPT-4o (27.02%) yields consistent gains. Few-shot prompting consistently improves performance across all model sizes (p < 0.001 by McNemar's test). The improvement ranges from +2.95 percentage points for GPT-4.1-mini to +4.60 points for GPT-4o. The few-shot examples appear to calibrate the model's output format and entity granularity, reducing cases where models produce entity types instead of specific entity names.

ID	Model	Mode	Exact (%)	Alias (%)	Contain (%)	Jaccard (%)
O1	GPT-4o	Zero-shot	27.02	33.30	33.30	35.05
O2	GPT-4o	Few-shot	31.62	38.94	38.94	41.10
O3	GPT-4.1-mini	Zero-shot	25.71	27.09	33.50	35.94
O4	GPT-4.1-mini	Few-shot	28.66	36.89	36.89	39.48
O5	Llama 3.1 8B	Zero-shot	13.92	14.81	19.47	20.18
O6	Llama 3.1 8B	Few-shot	17.83	18.80	24.61	25.66
O10	Llama 3.1 8B (QLoRA)	Fine-tuned	38.94	41.42	47.90	51.59
O11/b	GPT-4.1-mini CoT	t=0.7 / t=0.0	18.93 / 19.44	20.27 / 20.76	26.48 / 26.87	27.69 / 28.10
O12/b	GPT-4o CoT	t=0.7 / t=0.0	22.51 / 25.57	23.89 / 33.54	30.91 / 37.21	32.33 / 38.92
O13	Llama 3.1 8B CoT	t=0.7	6.22	6.69	11.72	12.24
RAG1	BGE + GPT-4.1-mini	RAG	19.71	20.53	28.75	29.55

The most striking open-world result is the effect of QLoRA fine-tuning. O10 (QLoRA Llama 3.1 8B) achieves 38.94% exact match, nearly tripling the base model's zero-shot performance (13.92%) and exceeding GPT-4o few-shot (31.62%) by 7.32 percentage points. At the Jaccard level, O10 reaches 51.59%, meaning more than half of all test predictions are at least partially correct. This is particularly notable given the entity-level split: O10 has never seen any of the 2,468 test entities during training, demonstrating genuine generalization of the implicit-to-entity mapping.

The failure of chain-of-thought prompting is equally striking. CoT reduces GPT-4o accuracy from 33.30% (zero-shot alias) to 23.89%, and GPT-4.1-mini from 25.71% (zero-shot exact) to 18.93%. CoT also degrades Llama 3.1 8B from 13.92% (zero-shot exact) to 6.22%. We analyze the reasons for this failure in Section 7.

The hybrid RAG approach (19.71% exact match) underperforms even GPT-4.1-mini zero-shot (25.71%). When the gold entity does not appear among the top-5 candidates (which occurs in roughly 67% of cases with BGE-base, given C2's Hit@5 of 33.41%), the LLM reranker cannot recover it.

6.2 Closed-World Performance

Table 5 shows the closed-world retrieval results. Fine-tuned DPR with description representations (C5) achieves the best performance: 35.38% Hit@1, 71.49% Hit@10, and 0.4751 MRR. With alias-aware evaluation, C5 reaches 42.80% Hit@1 and 74.47% Hit@10.

ID	Retriever	Entity Repr.	Hit@1 (%)	Hit@3 (%)	Hit@5 (%)	Hit@10 (%)	MRR	Alias H@1 (%)
C1	BGE (off-the-shelf)	Name	16.51	26.38	30.97	36.76	0.2362	22.08
C2	BGE (off-the-shelf)	Description	16.64	27.78	33.41	40.60	0.2480	21.78
C3	BGE (off-the-shelf)	Wiki	14.38	25.10	29.92	37.32	0.2211	19.32
C4	DPR (fine-tuned)	Name	30.00	46.36	53.66	63.31	0.4131	37.10
C5	DPR (fine-tuned)	Description	35.38	53.51	61.82	71.49	0.4751	42.80
C6	DPR (fine-tuned)	Wiki	27.95	44.98	51.82	59.55	0.3851	34.38

The comparison between off-the-shelf BGE and fine-tuned DPR reveals the magnitude of domain adaptation benefits. DPR fine-tuning more than doubles Hit@1 for all entity representation types: Name (16.51% to 30.00%, +13.49 pp), Description (16.64% to 35.38%, +18.74 pp), and Wiki (14.38% to 27.95%, +13.57 pp). The largest absolute gain occurs for descriptions, indicating that fine-tuning is especially effective at learning to align the narrative cue structure with the rich attribute content in entity descriptions.

Across both retrieval architectures, entity description representations consistently outperform name-only and Wikipedia representations. Descriptions provide a concise, attribute-rich summary that aligns well with the contextual cues present in elided narratives. Wikipedia lead sentences, despite containing more information, introduce noise from tangential content.

6.3 Cross-Paradigm Comparison

Table 6 ranks the top-performing systems across both paradigms under a unified alias-aware Hit@1 metric.

Table 6: Cross-paradigm ranking by alias-aware Hit@1. Open-world methods use the 4-tier alias evaluation; closed-world methods use alias-aware Hit@1.

Rank	System	Paradigm	Alias H@1 (%)
1	O10 (QLoRA Llama 8B)	Open	51.59
2	C5 (DPR + Description)	Closed	42.80
3	O2 (GPT-4o FS)	Open	41.10
4	O4 (GPT-4.1-mini FS)	Open	39.48
5	C4 (DPR + Name)	Closed	37.10
6	O3 (GPT-4.1-mini ZS)	Open	35.94
7	O1 (GPT-4o ZS)	Open	35.05
8	C6 (DPR + Wiki)	Closed	34.38

The fine-tuned QLoRA model (O10) leads by a substantial margin, achieving 51.59% Jaccard accuracy. The fine-tuned DPR retriever (C5) ranks second at 42.80% alias-aware Hit@1, outperforming GPT-4o few-shot (41.10%). This is notable because C5 uses only a 110M-parameter embedding model, while GPT-4o is estimated at well over 100B parameters.

6.4 Per-Entity-Type Analysis

Performance varies substantially by entity type. Table 7 reports the alias-aware Hit@1 (all tiers) for selected methods.

Entity Type	n	O1 GPT-4o ZS	O2 GPT-4o FS	O5 Llama 8B ZS	C1 BGE Name	C2 BGE Desc
Place	2,076	38.15	43.88	18.16	14.88	15.99
Organization	1,152	38.28	45.31	27.34	29.17	27.17
Person	698	23.82	24.07	14.90	18.34	18.62
Event	273	34.43	50.18	27.11	48.35	47.25
Work	215	32.09	39.53	14.42	39.07	36.74
Military Unit	121	26.45	37.19	10.74	23.97	31.40
Other	98	30.61	36.73	21.43	33.67	25.51

Persons are the hardest type for open-world methods. GPT-4o FS achieves only 24.07% on Person entities, compared to 45.31% on Organizations and 43.88% on Places. Person entities often have less distinctive contextual cues and are more likely to be obscure individuals not well represented in model training data.

Events are notably strong for closed-world methods. BGE achieves 48.35% Hit@1 on Events, higher than any other type, suggesting that event descriptions provide distinctive semantic signatures that align well with implicit event narratives.

Few-shot examples disproportionately help Events. GPT-4o jumps from 34.43% (ZS) to 50.18% (FS) on Events (+15.75 pp), the largest per-type improvement, likely because the few-shot examples include two Event instances (Attack on Pearl Harbor).

6.5 Error Analysis

We performed automated error classification on 200 randomly sampled incorrect predictions from each of O1 through O6, using GPT-4.1-mini to categorize errors. Table 8 reports the distribution.

Error Type	O1 4o ZS	O2 4o FS	O3 mini ZS	O4 mini FS	O5 Llama ZS	O6 Llama FS
Same-type, unrelated	43.0	42.0	43.5	45.0	52.0	46.0
Wrong type	28.5	27.5	29.5	22.5	31.0	35.0
Same-type, related	24.5	25.5	22.5	24.0	13.5	17.0
Partial match	3.5	4.0	3.0	6.0	2.5	1.5
Empty / hallucination	0.5	1.0	1.5	2.5	1.0	0.0

The dominant error mode across all models is same-type, unrelated (42% to 52%), where the model predicts an entity of the correct type but one that is semantically unrelated to the gold entity (e.g., predicting "Jack Johnson" when the gold is "Lou Ambers," both boxers). The second most common error is wrong type (22.5% to 35.0%), where the model predicts an entity of an entirely different category. Same-type, related errors (13.5% to 25.5%) represent near-misses where the prediction is semantically close to the gold (e.g., predicting "Okinawa" for "Iwo Jima"). Hallucinations and empty responses are rare (<2.5%), indicating that models reliably produce plausible entity names even when incorrect.

Llama 3.1 8B (O5, O6) shows a higher proportion of same-type, unrelated errors (52.0% and 46.0%) and a lower proportion of same-type, related errors (13.5% and 17.0%) compared to GPT models (O1, O2: 24.5% and 25.5%). This suggests that smaller models have weaker ability to narrow down candidates within a type using fine-grained contextual cues.

6.6 Key Findings Summary

We summarize the principal findings as a numbered list, with each claim supported by specific experimental comparisons:

Finding 1: Fine-tuning is the most impactful intervention. QLoRA fine-tuning of Llama 3.1 8B raises exact match from 13.92% (O5, zero-shot) to 38.94% (O10), a 2.80x improvement. DPR fine-tuning of BGE raises Hit@1 from 16.64% (C2) to 35.38% (C5), a 2.13x improvement. Both gains are achieved despite zero entity overlap between training and test sets.

Finding 2: QLoRA fine-tuning yields the overall best performance. O10 achieves 38.94% exact match (51.59% Jaccard), surpassing GPT-4o few-shot (31.62% exact, 41.10% Jaccard) by 7.32 pp on exact match and 10.49 pp on Jaccard. This result is achieved with only 6.5M trainable parameters on top of an 8B-parameter base.

Finding 3: Chain-of-thought degrades all models. CoT reduces GPT-4o from 33.30% (ZS alias) to 23.89% (a 28.3% relative drop), GPT-4.1-mini from 25.71% (ZS exact) to 18.93% (a 26.4% drop), and Llama 3.1 8B from 13.92% (ZS exact) to 6.22% (a 55.3% drop). To rule out temperature as a confounding factor (CoT experiments used t=0.7 vs. t=0.0 for direct prompting), we repeated O11 and O12 at t=0.0. For GPT-4.1-mini, the effect is negligible (+0.5pp alias), confirming that CoT structurally degrades performance on this task. For GPT-4o, lowering temperature recovers 9.6pp (alias rising from 23.9% to 33.5%), reaching parity with zero-shot (33.3%) but not exceeding it. This indicates that for GPT-4o, the temperature difference accounts for the majority of the observed CoT penalty, while the reasoning structure itself neither helps nor hurts. For smaller models, CoT is genuinely harmful regardless of temperature.

Finding 4: Few-shot prompting consistently helps. Adding 5 demonstrations improves GPT-4o from 27.02% to 31.62% (+4.60 pp), GPT-4.1-mini from 25.71% to 28.66% (+2.95 pp), and Llama 3.1 8B from 13.92% to 17.83% (+3.91 pp). All differences are significant (p < 0.001).

Finding 5: Entity descriptions are the best retrieval representation. C5 (DPR+Desc) outperforms C4 (DPR+Name) by 5.38 pp on Hit@1 (35.38% vs. 30.00%) and C6 (DPR+Wiki) by 7.43 pp (35.38% vs. 27.95%). The pattern holds for off-the-shelf BGE as well.

Finding 6: RAG underperforms direct LLM inference. RAG1 (19.71% exact match) is 5.99 pp below GPT-4.1-mini zero-shot (25.71%) and 8.95 pp below GPT-4.1-mini few-shot (28.66%). The retrieval bottleneck is the limiting factor.

Finding 7: Model scale matters substantially in the zero-shot regime. GPT-4o ZS (27.02%) outperforms Llama 3.1 8B ZS (13.92%) by 13.10 pp (McNemar chi-squared = 432.28, p < 0.001, with 892 vs. 203 discordant pairs).

Finding 8: The retriever's Hit@10 reveals strong latent signal. C5 achieves 71.49% Hit@10, meaning the gold entity is in the top-10 for nearly three-quarters of queries. Combining DPR shortlists with LLM reranking is a promising direction.

6.7 Statistical Significance

All key comparisons are statistically significant at p < 0.001 (McNemar's test with continuity correction). Table 9 reports the detailed results.

Comparison	Acc A (%)	Acc B (%)	McNemar χ²	A-only	B-only
O1 vs O2 (ZS vs FS, GPT-4o)	35.06	41.11	149.69	120	400
O3 vs O4 (ZS vs FS, mini)	35.16	38.72	36.18	150	275
O1 vs O5 (GPT-4o vs Llama 8B)	35.06	20.19	432.28	892	203
O1 vs C2 (Open vs Closed)	35.06	22.58	181.93	1,204	626

The discordant pair counts are informative: for the GPT-4o vs. Llama 8B comparison, 892 samples are solved only by GPT-4o while only 203 are solved only by Llama 8B, demonstrating a strong directional advantage. For the ZS vs. FS comparison on GPT-4o (O1 vs. O2), 400 samples are gained while only 120 are lost, confirming that few-shot examples provide a net benefit with limited trade-offs. The 95% bootstrap confidence intervals confirm non-overlapping ranges for all reported comparisons.

7. Discussion

7.1 Implications for system design

Three actionable consequences follow from our results. (i) RAG bottleneck. The RAG pipeline underperforms direct generation (19.71% vs 28.66% for GPT-4.1-mini FS) because dense retrievers encode the EEN as a single vector, losing fine-grained cue information; retrieved candidates are topically related but often incorrect, and when presented as context they can override the model's own correct intuition. With BGE-base, the gold entity appears in the top-5 only 33.41% of the time, severely limiting the reranker. Systems indexing oral-history archives over Wikidata-aligned KBs should expect this ~9 pp penalty unless they fine-tune the retriever. (ii) Disable CoT. The failure of chain-of-thought is the most counterintuitive finding. CoT improves mathematical reasoning and multi-hop QA by decomposing problems [32], yet it degrades implicit entity recognition for every model tested because identifying an implicit entity requires simultaneously attending to a constellation of distributed cues; when forced to reason step by step, models fixate on individual cues in isolation, arriving at locally plausible but globally incorrect entities. Temperature control experiments (O11b, O12b) confirm this is structural for smaller models (GPT-4.1-mini: +0.5pp at t=0.0, still 6.3pp below ZS); for GPT-4o, controlling temperature recovers parity (33.5% vs 33.3%) without exceeding it. CoT therefore neither helps nor hurts large models once temperature is controlled, but genuinely harms smaller models. (iii) Cheap QLoRA wins. A short QLoRA pass on a domain corpus closes a 25 pp gap against a frontier closed model at under 20 USD of GPU cost, materially changing the cost curve for building entity-grounding components for archives.

7.2 Generalization and boundary conditions

The non-locality property is language- and domain-agnostic: it follows from how speakers refer, not from English-specific phenomena. Our empirical claims are bounded by language (English), narrator perspective (first-person reminiscence), entity universe (Wikidata-linkable), and domain (US-centric oral histories). We expect the pipeline to transfer to non-English memoirs and to clinical case histories, but cue density on celebrity entities (over-represented in pre-training) and on hyper-local entities (under-represented) sets the realistic accuracy band; the per-entity-type spread (Persons 24%, Events 50%) is a within-domain reflection of this, driven by cue specificity and KB-neighborhood density. The success of QLoRA (O10: 38.94% exact, up from 13.92% base) despite zero entity overlap indicates that fine-tuning teaches three transferable skills, not memorization: the task format (extracting a single canonical name), cue integration patterns (which temporal, spatial, and relational combinations are diagnostic), and entity type priors (calibrating expectations to reduce wrong-type errors). The model learns "how to solve implicit entity puzzles" rather than memorizing specific answers. Comparing the best open-world (O10: 38.94%) and closed-world (C5: 35.38% Hit@1, 42.80% alias) results, both paradigms reach comparable alias-level performance, while C5's 71.49% Hit@10 suggests that combining fine-tuned retrieval shortlists with fine-tuned LLM reranking is a promising future direction.

7.3 Application areas

IRC-Bench supports four downstream application areas. (i) Digital humanities: automatic entity-grounded indexing of the 10⁵-order oral history corpora curated by national archives. (ii) Reminiscence therapy systems: grounding implicit references in dialogue to retrieve relevant photos or documents [40]. (iii) Privacy-preserving de-identification: the Stage-4 EENs are a controlled testbed for whether an "anonymized" narrative remains re-identifiable from contextual cues alone, with policy implications for archive release. (iv) LLM knowledge evaluation: IRC-Bench measures knowledge of marginalized, regional, and historical entities under-represented in QA benchmarks, exposing gaps that MMLU-style evaluations do not.

7.4 What IRC-Bench measures that prior tasks do not

NER measures span localization; entity linking measures name-to-KB resolution; coreference resolution measures intra-document anaphora given at least one explicit mention. IRC-Bench measures knowledge-grounded cue integration in the absence of any antecedent, complementing rather than substituting for any of the three. The non-locality ablation (33.5% full-text vs. 12.9% single-sentence, §4.2) is the empirical operationalization of this distinction; we recommend it as a per-task diagnostic for new datasets that claim implicit-reference status.

7.5 Threats to validity

LLM-family overlap. The pipeline uses GPT-4.1-mini (Stage 4 rewriter), GPT-4o (judge), and a mix of OpenAI and Llama models as evaluation targets. The strongest evidence against systematic family bias is that the best system on IRC-Bench is O10, a QLoRA-adapted Llama 3.1 8B at 38.94% exact match, exceeding GPT-4o few-shot (31.62%) by 7.32 pp; if our pipeline favored OpenAI models, GPT-4o would top the leaderboard, which it does not. The judge (GPT-4o) differs in training mix and class from the generator (GPT-4.1-mini), and the leakage check is an objective alias-string match independent of judge subjectivity. The DPR retrieval line (BGE-base, a Hugging Face encoder) shows the same per-type ordering as the LLM line (Events easiest, Persons hardest), an independent cross-check that the difficulty signal is not GPT-specific.

AI-only validation. We acknowledge the absence of human ratings on the calibration set as a limitation, and release the GPT-4o judge prompt and 500 raw judgments for community re-validation. The strongest argument against LLM-judge bias affecting the headline finding is that the GPT-4o-judged recoverability rate (42.0%) matches the best Llama-based system's alias accuracy (41.4%) within 0.6 pp; the two estimates are produced by architecturally independent pipelines.

Other limitations. The benchmark covers English-language oral histories focused primarily on American experiences. The LLM-generated entity elision, while validated (naturalness 4.87/5, leakage 6.8% detected and 100% filtered, recoverability 42.0% matching the best system within 0.6 pp), is not a substitute for human-written implicit references. Alias-aware evaluation still penalizes semantically correct predictions using unregistered surface forms. Temperature controls were not run for Llama 3.1 8B CoT. QLoRA training used max_seq=192 tokens, which truncates approximately 6% of test prompts.

8. Conclusion

We have extended implicit entity recognition, previously studied in short social-media text [21, 22], to the domain of long-form reminiscence narratives, formalizing the non-locality property that distinguishes this setting and empirically validating it through a sentence-level ablation showing a 20.6pp accuracy gap between full-text and single-sentence inference. We release IRC-Bench, a benchmark of 25,136 samples spanning 12,337 Wikidata-linked entities from 1,994 oral history transcripts across 11 thematic domains. Our systematic evaluation across 19 experimental configurations reveals eight key findings.

First, fine-tuning is the single most impactful intervention. QLoRA-adapted Llama 3.1 8B achieves 38.94% exact match (51.59% Jaccard), nearly tripling the base model's zero-shot performance and surpassing GPT-4o few-shot by 7.32 percentage points, despite the entity-level split ensuring zero overlap with training entities. In the closed-world setting, DPR fine-tuning of BGE-base more than doubles Hit@1 from 16.64% to 35.38%, with the gold entity appearing in the top-10 for 71.49% of queries.

Second, chain-of-thought prompting degrades smaller models (by 4.51 to 7.70 pp), while temperature control experiments reveal that for GPT-4o, the observed CoT penalty is largely attributable to the higher sampling temperature rather than the reasoning structure itself. In all cases, CoT fails to exceed zero-shot performance, confirming that implicit entity recognition requires holistic pattern matching rather than sequential reasoning. Third, retrieval-augmented generation underperforms direct LLM inference due to the non-locality of implicit cues. Fourth, model scale is the dominant factor in zero-shot open-world accuracy, with performance spanning from 13.92% (Llama 3.1 8B) to 27.02% (GPT-4o) in exact match. Fifth, entity descriptions are consistently the best representation for dense retrieval, outperforming both entity names and Wikipedia lead sentences.

Future work should explore several promising directions: multi-modal implicit entity recognition incorporating audio features from the original recordings, cross-lingual benchmarks constructed from oral history archives in other languages, active learning approaches that combine fine-tuned DPR shortlists with fine-tuned LLM reranking (leveraging C5's 71.49% Hit@10), and the development of specialized architectures that explicitly model the non-locality property of implicit entity cues through structured attention over distributed text spans.

References

[1] Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3-26.

[2] Li, J., Sun, A., Han, J., and Li, C. (2022). A survey on deep learning for named entity recognition. IEEE TKDE, 34(1):50-70.

[3] Ganea, O.-E. and Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. In Proc. EMNLP, pages 2619-2629.

[4] Kolitsas, N., Ganea, O.-E., and Hofmann, T. (2018). End-to-end neural entity linking. In Proc. CoNLL, pages 519-529.

[5] Lee, K., He, L., Lewis, M., and Zettlemoyer, L. (2017). End-to-end neural coreference resolution. In Proc. EMNLP, pages 188-197.

[6] Boyd, D. A. (2012). Achieving the promise of oral history in a digital age. In Ritchie, D. A., editor, The Oxford Handbook of Oral History. Oxford University Press.

[7] Lazar, A., Demiris, G., and Thompson, H. (2016). Evaluation of a multifunctional technology system in a memory care unit: Opportunities for innovation in dementia care. Informatics for Health and Social Care, 41(4):373-389.

[8] Subramaniam, P. and Woods, B. (2012). The impact of individual reminiscence therapy for people with dementia. Expert Review of Neurotherapeutics, 12(5):545-555.

[9] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. In Proc. NAACL, pages 260-270.

[10] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, pages 4171-4186.

[11] Xie, T., Li, Q., Zhang, J., Zhang, Y., Liu, Z., and Wang, H. (2023). Empirical study of zero-shot NER with ChatGPT. In Proc. EMNLP, pages 7935-7956.

[12] Ashok, D. and Lipton, Z. C. (2023). PromptNER: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444.

[13] Sang, E. T. K. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task. In Proc. CoNLL, pages 142-147.

[14] Malmasi, S., et al. (2022). MultiCoNER: A large-scale multilingual dataset for complex named entity recognition. In Proc. COLING.

[15] Li, J., Fei, H., Liu, J., Wu, S., Zhang, M., Teng, C., Ji, D., and Li, F. (2022). Unified named entity recognition as word-word relation classification. In Proc. AAAI.

[16] Zhou, W., Zhang, S., Gu, Y., Chen, M., and Poon, H. (2024). UniversalNER: Targeted distillation from large language models for open named entity recognition. In Proc. ICLR.

[17] Wu, L., Petroni, F., Josifoski, M., Riedel, S., and Zettlemoyer, L. (2020). Scalable zero-shot entity linking with dense entity retrieval. In Proc. EMNLP, pages 6397-6407.

[18] De Cao, N., Izacard, G., Riedel, S., and Petroni, F. (2021). Autoregressive entity retrieval. In Proc. ICLR.

[19] Ayoola, T., Tyagi, S., Fisher, J., Christodoulopoulos, C., and Pierleoni, A. (2022). ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking. In Proc. NAACL (Industry Track).

[20] Botha, J. A., Shan, Z., and Gillick, D. (2020). Entity linking in 100 languages. In Proc. EMNLP, pages 7833-7845.

[21] Hosseini, H. (2022). Implicit entity recognition and linking in tweets. PhD thesis, Toronto Metropolitan University.

[22] Hosseini, H. and Bagheri, E. (2021). Learning to rank implicit entities on Twitter. Information Processing & Management, 58(3):102503.

[23] Perera, N., Dehmer, M., and Emmert-Streib, F. (2020). Named entity recognition and relation detection for biomedical information extraction. Frontiers in Cell and Developmental Biology, 8:673.

[24] Treder, M. S., Lee, S., and Tsvetanov, K. A. (2024). Introduction to large language models (LLMs) for dementia care and research. Frontiers in Dementia, 3:1385303.

[25] Broadbent, E., Stafford, R., and MacDonald, B. (2009). Acceptance of healthcare robots for the older population: Review and future directions. International Journal of Social Robotics, 1(4):319-330.

[26] de Jager, A., Fogarty, A., Tewson, A., Lenette, C., and Boydell, K. M. (2017). Digital storytelling in research: A systematic review. The Qualitative Report, 22(10):2548-2582.

[27] Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proc. EMNLP.

[28] Petroni, F., Rocktaschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. (2019). Language models as knowledge bases? In Proc. EMNLP.

[29] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. NeurIPS.

[30] Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. (2020). Dense passage retrieval for open-domain question answering. In Proc. EMNLP, pages 6769-6781.

[31] Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., and Nie, J.-Y. (2023). C-Pack: Packaged resources to advance general Chinese embedding. arXiv preprint arXiv:2309.07597.

[32] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS.

[33] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proc. ICLR.

[34] Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized language models. In Proc. NeurIPS.

[35] Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[36] Dubey, A., et al. (2024). The Llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[37] OpenAI (2024). GPT-4o system card. Technical Report.

[38] Butler, R. N. (1963). The life review: An interpretation of reminiscence in the aged. Psychiatry, 26(1), 65-76.

[39] Webster, J. D. (1993). Construction and validation of the Reminiscence Functions Scale. Journal of Gerontology, 48(5), P256-P262.

[40] Nikitina, S., Callaioli, S., and Baez, M. (2018). Smart conversational agents for reminiscence. Proceedings of the 1st International Workshop on Software Engineering for Cognitive Services, 52-57.

[41] Pessanha, F. and Akdag Salah, A. (2022). A computational look at oral history archives. ACM Journal on Computing and Cultural Heritage, 15(1):6:1-6:16.

[42] Hou, Y. (2020). Bridging anaphora resolution as question answering. In Proceedings of the 58th Annual Meeting of the ACL, pages 1428-1438.

[43] Poesio, M., Stuckardt, R., and Versley, Y. (2016). Anaphora Resolution: Algorithms, Resources, and Applications. Springer.

[44] Ding, Y., Zeng, Q., and Weninger, T. (2024). ChatEL: Entity Linking with Chatbots. arXiv preprint arXiv:2402.14858.

[45] Xin, A., Qi, Y., Yao, Z., Zhu, F., Zeng, K., Bin, X., Hou, L., and Li, J. (2024). LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking. arXiv preprint arXiv:2407.04020.

[46] Kuratov, Y., Bulatov, A., Anokhin, P., Rodkin, I., Sorokin, D., Sorokin, A., and Burtsev, M. (2024). BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. In Proc. NeurIPS Datasets and Benchmarks Track. arXiv:2406.10149.

[47] Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? In Proc. COLM. arXiv:2404.06654.

[48] Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., and Schütze, H. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. In Proc. ICML. arXiv:2502.05167.

[49] Žabokrtský, Z., Konopik, M., Nedoluzhko, A., Novák, M., Ogrodniczuk, M., Popel, M., Prazák, O., Sido, J., and Zeman, D. (2023). Findings of the Second Shared Task on Multilingual Coreference Resolution. In Proc. CRAC 2023 Shared Task at EMNLP.

[50] Draxler, C., van den Heuvel, H., van Hessen, A., Ircing, P., and Lehečka, J. (2024). Speech Technology Services for Oral History Research. In Proc. First Workshop on Holocaust Testimonies as Language Resources (HTRes) at LREC-COLING.

[51] Cherukuri, K. S., Moses, P. A., Sakata, A., Chen, J., and Chen, H. (2025). Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis. arXiv preprint arXiv:2508.06729.

Base model	BAAI/bge-base-en-v1.5
Model parameters	~110M
Embedding dimension	768
Training examples	17,971
Epochs	3
Batch size	48
Learning rate	2e-5
Warmup steps	100
Loss function	MultipleNegativesRankingLoss (MNRL)
Optimizer	AdamW
Mixed precision	FP16 (AMP)
Negatives	In-batch (47 negatives per sample)
Random seed	42

Base model	meta-llama/Llama-3.1-8B-Instruct
Model parameters	~8B (base); ~6.5M trainable (LoRA)
Quantization	4-bit NormalFloat (NF4)
Compute dtype	bfloat16
LoRA rank (r)	16
LoRA alpha (α)	32
LoRA dropout	0.05
Target modules	q_proj, v_proj, k_proj, o_proj
Training examples	17,971
Epochs	2
Per-device batch size	48
Gradient accumulation	1 (effective batch: 48)
Learning rate	2e-4
Max sequence length	192 tokens
Warmup steps	50
Precision	bfloat16
Validation samples	500 (subset of dev)
Framework	TRL SFTTrainer + PEFT

IRC-Bench: Recognizing Entities from Contextual Cuesin First-Person Reminiscences