Long (HTML | Doc) · Short (HTML | Doc) · NeurIPS (HTML | Doc)

IRC-Bench: Recognizing Entities from Contextual Cues
in First-Person Reminiscences

Alexander Apartsin1, Eden Moran2, Yehudit Aperstein2
1School of Computer Science, Faculty of Sciences, HIT-Holon Institute of Technology, Holon 58102, Israel
2Intelligent Systems, Afeka Academic College of Engineering, Tel Aviv 69988, Israel

Abstract

When people share personal reminiscences, they routinely reference people, places, and events through contextual cues alone, assuming their audience can identify what is meant without explicit naming. This phenomenon is especially prevalent in reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social contexts. Building on prior work that established implicit entity recognition in short social-media text [21, 22], we extend this task to the reminiscence domain, where entity cues are distributed across multiple clauses rather than concentrated in a single short message. We release IRC-Bench (Implicit Reminiscence Context Benchmark), a benchmark of 25,136 samples derived from 12,337 unique Wikidata-linked entities across 1,994 reminiscence transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative (EGN) containing the entity name with an Entity-Elided Narrative (EEN) from which all explicit mentions have been removed; systems must recover the correct entity given only the EEN. We identify a key structural property of implicit references, their non-locality: recognition cues are distributed across multiple non-contiguous clauses, fundamentally distinguishing implicit entity recognition from named entity recognition, entity linking, and coreference resolution. We evaluate 19 experimental configurations spanning open-world LLM generation, closed-world dense retrieval, hybrid RAG, and fine-tuning approaches. QLoRA-adapted Llama 3.1 8B achieves the highest open-world exact match at 38.94% (51.59% Jaccard), while fine-tuned DPR with entity descriptions reaches 35.38% Hit@1 (42.80% alias-aware) and 71.49% Hit@10 in the closed-world setting. Chain-of-thought prompting consistently degrades performance across all models, and retrieval-augmented generation underperforms direct LLM inference. All data, code, and evaluation tools are publicly released.

Keywords: implicit entity recognition, IRC-Bench, reminiscence narratives, coreference resolution, non-locality, large language models, dense passage retrieval, QLoRA, benchmark, entity linking, Wikidata

1. Introduction

Reminiscence, the act of recalling and sharing personal memories, plays a central role in human social life. In clinical settings, reminiscence therapy has been shown to reduce depression and improve well-being in older adults [6, 7], while in archival contexts, recorded reminiscences preserve cultural and historical knowledge that would otherwise be lost [8]. A defining characteristic of reminiscence narratives is that speakers assume shared context with their audience: they reference people, places, and events through contextual cues rather than explicit naming, trusting the listener to fill in the gaps. This implicit referencing is natural in conversation but creates a fundamental challenge for automated systems that seek to index, search, or analyze these narratives.

Consider the following passage from a Japanese American reminiscence:

Entity-Grounded Narrative (EGN)

"The attack on Pearl Harbor was the event that changed everything for Japanese Americans like me. After December 7, 1941, suspicion and hatred grew, and we were treated as enemy aliens despite being American citizens. It was because of Pearl Harbor that the government issued Executive Order 9066 and started the forced relocation."

Entity-Elided Narrative (EEN)

"The surprise attack on a naval base in Hawaii was the event that changed everything for Japanese Americans like me. After December 7, 1941, suspicion and hatred grew, and we were treated as enemy aliens despite being American citizens. It was because of that attack that the government issued an order and started the forced relocation."

Gold entity: Attack on Pearl Harbor (Q52418)  |  Type: Event  |  Cues: December 7, 1941; naval base in Hawaii; Executive Order 9066; forced relocation

A human reader readily identifies the Attack on Pearl Harbor from the constellation of cues: the date, the Hawaiian naval base, the executive order, the internment of Japanese Americans. No single phrase names the entity; instead, recognition depends on integrating cultural, temporal, and historical knowledge distributed across the entire passage. This pattern of implicit entity reference is pervasive in reminiscence narratives, where speakers routinely allude to well-known people, places, and events without naming them, relying on shared background knowledge with their listener.

This phenomenon falls between existing NLP tasks without being addressed by any of them. Named Entity Recognition (NER) identifies explicitly mentioned entity spans in text [1, 2]. Entity Linking (EL) resolves those spans to knowledge base entries [3, 4]. Coreference resolution connects multiple references to the same entity but requires at least one explicit mention as an antecedent [5]. In implicit entity references, the entity is never named anywhere in the text; there is no span to extract, no mention to link, no antecedent to resolve. The task can be viewed as a form of zero-mention coreference: resolving a reference to an entity that has no surface realization in the text, only a distributed constellation of contextual cues.

While implicit entity recognition was first explored in short social-media text [21, 22], we extend it to a fundamentally different setting: long-form reminiscence narratives where entity cues are non-local, distributed across multiple clauses. We release IRC-Bench (Implicit Reminiscence Context Benchmark), a large-scale evaluation resource constructed from real reminiscence transcripts. This task addresses practical needs across multiple domains. Archives of personal reminiscences, including oral history collections containing millions of hours of recorded testimony, remain largely inaccessible to structured search because the entities discussed are rarely stated by name [9]. In healthcare, reminiscence therapy is a widely used intervention for older adults with dementia and depression [6, 7, 10]; automated systems that support these therapeutic conversations must identify the people and events being discussed even when the speaker does not name them. Social robotics and conversational AI for elderly companionship similarly require understanding implicit references to engage meaningfully with users' personal histories [11, 12]. More broadly, information retrieval over personal narratives requires understanding not just what is said, but what is meant.

Our contributions are as follows:

  1. Task extension and non-locality. Building on Hosseini's [21] formulation of implicit entity recognition in tweets, we extend the task to long-form reminiscence narratives and formalize the non-locality property of implicit references in this domain: entity cues are distributed across multiple non-contiguous text spans, requiring holistic integration rather than local pattern matching. This structural property, absent in short-text settings, fundamentally distinguishes reminiscence-based implicit entity recognition from prior formulations.
  2. IRC-Bench. We release a benchmark of 25,136 implicit entity recognition samples derived from 12,337 unique Wikidata-linked entities sourced from 1,994 reminiscence transcripts across 11 thematic domains, with entity-level train/dev/test splits ensuring zero entity overlap between partitions. Each sample includes both an EGN and an EEN, along with entity metadata (QID, aliases, Wikipedia description).
  3. Comprehensive evaluation. We systematically compare 19 experimental configurations spanning open-world LLM inference (zero-shot, few-shot, chain-of-thought, QLoRA fine-tuning), closed-world dense retrieval (off-the-shelf and DPR fine-tuned), and hybrid RAG, revealing that fine-tuning doubles performance in both paradigms, chain-of-thought reasoning degrades performance on this task, and model scale is the dominant factor in open-world accuracy.

2.1 Named Entity Recognition

Named Entity Recognition identifies and classifies explicit entity mentions in text. Classical approaches relied on handcrafted features and conditional random fields [1], while modern systems employ deep learning architectures including BiLSTM-CRF [9], transformer-based sequence labeling [10], and large language model prompting [11, 12]. Recent benchmarks such as CoNLL-2003 [13] and MultiCoNER [14] have driven progress across entity types and languages. The W2NER framework [15] unified flat, nested, and discontinuous NER as word-word relation classification, and UniversalNER [16] demonstrated targeted distillation from LLMs for open-domain entity extraction. Despite these advances, all NER formulations assume the target entity appears as an explicit surface form in the input text, an assumption that does not hold for implicit references.

2.2 Entity Linking

Entity Linking resolves textual mentions to entries in a knowledge base. Neural approaches include local attention models [3], bi-encoder architectures such as BLINK [17], autoregressive generation via GENRE [18], and efficient zero-shot systems like ReFinED [19]. Botha et al. [20] extended entity linking to over 100 languages. These systems take an identified mention span as input and rank candidate entities; they cannot operate when no mention span exists. Implicit entity recognition requires generating entity candidates from distributed contextual cues rather than resolving a given span.

2.3 Reminiscence Analysis and NLP

Reminiscence, the structured recall of autobiographical memories, has been studied extensively in psychology and gerontology. Butler [38] first proposed life review as a therapeutic process, and subsequent work established reminiscence therapy as an evidence-based intervention for depression and cognitive decline in older adults [6, 7]. Webster [39] developed the Reminiscence Functions Scale, identifying eight distinct functions of autobiographical memory sharing. Computational approaches to reminiscence have focused primarily on two areas: reminiscence therapy systems and oral history processing. Therapy-oriented systems use conversational agents or social robots to elicit and respond to personal memories [11, 12, 40], while oral history processing addresses transcription, topic segmentation, and search [9, 41]. However, none of these systems address the fundamental challenge of identifying the entities that speakers reference implicitly. Our work bridges this gap by extending implicit entity recognition, previously studied only in short social-media text [21, 22], to the reminiscence domain and providing the first benchmark derived from real reminiscence narratives.

2.4 Implicit and Zero-Mention Entities

Limited prior work has addressed entities that are referenced but not named. Hosseini [21] introduced implicit entity recognition in tweets, constructing a dataset of 3,119 tweets with implicit entity mentions. Hosseini and Bagheri [22] developed learning-to-rank methods for this Twitter dataset. Perera et al. [23] explored implicit entity recognition in clinical documents. The coreference resolution community has studied "zero anaphora" and bridging references [42, 43], where an entity is referenced indirectly through related concepts.

Our work differs from these efforts in five fundamental ways. First, domain and text structure. Tweets are short (under 280 characters), formulaic, and heavily context-dependent on trending topics; clinical notes follow rigid templates. Reminiscence narratives are extended first-person accounts (typically 50 to 200 words per sample) with rich, diffuse contextual cues spanning dates, locations, personal relationships, sensory details, and historical events. Second, non-locality. In tweets, the implicit entity is typically inferable from a single cue or hashtag context. In reminiscence narratives, we formalize and empirically demonstrate the non-locality property: recognition requires integrating multiple non-contiguous cues distributed across the entire passage. No prior work has identified or characterized this structural property. Third, scale and diversity. IRC-Bench contains 25,136 samples spanning 12,337 unique Wikidata-linked entities across 11 thematic domains, compared to 3,119 tweet samples in Hosseini [21] covering primarily entertainment and sports entities. Fourth, entity-level evaluation. We introduce entity-level train/test splitting with zero entity overlap, ensuring that models must generalize to entirely unseen entities rather than memorizing entity-specific patterns. Prior benchmarks used random sample-level splits where the same entity could appear in both training and test data. Fifth, comprehensive method comparison. We systematically evaluate 17 configurations spanning four paradigms (generative LLM, dense retrieval, RAG, fine-tuning), whereas prior work evaluated at most two to three approaches on a single paradigm.

2.5 Oral History NLP

Computational analysis of oral histories has received growing attention. Technology-assisted reminiscence systems have been developed for dementia care [7, 8], and AI-driven conversational agents have been explored as companions for elderly users [24, 25]. Digital storytelling platforms combining AI with augmented reality enable communities to preserve personal narratives [26]. However, these systems primarily facilitate memory recall and do not attempt to recover the implicit entities that speakers reference without naming.

2.6 Knowledge-Grounded Question Answering

The closest existing task to implicit entity recognition is knowledge-grounded question answering, where a system must reason over both a text passage and an external knowledge base to produce an answer [27, 28]. Retrieval-augmented generation (RAG) approaches retrieve relevant knowledge base passages and condition generation on them [29, 30]. While implicit entity recognition shares the requirement for external knowledge, it differs in that the "question" is an entire narrative rather than a targeted query, and the answer is always a single entity rather than a free-form text span. Furthermore, implicit entity recognition exhibits the non-locality property: the relevant cues are distributed throughout the passage rather than concentrated near a question token. This structural difference, as we show empirically, causes standard RAG pipelines to underperform direct LLM inference.


3. Dataset Construction

3.1 Overview

IRC-Bench is constructed through a four-stage automated pipeline that transforms oral history transcripts into implicit entity recognition samples. Each sample consists of a first-person narrative that references a named entity through contextual cues alone, without ever naming it. The pipeline leverages GPT-4.1-mini for entity extraction, summary generation, and implicit rewriting, producing 25,136 benchmark samples spanning 12,337 unique entities.

3.2 Source Collections

The raw data comprises 1,994 cleaned oral history transcripts drawn from 11 thematic collections. These collections provide broad topical diversity, covering military conflicts, social movements, immigration, public health crises, labor history, and academic life. The first-person narrative style of oral histories naturally provides rich contextual cues (dates, locations, relationships, roles, events) that make implicit entity references solvable for knowledgeable readers, while remaining challenging for automated systems.

Table 1: Source collections for IRC-Bench. Transcript counts reflect cleaned JSON files after the processing pipeline.

Collection Transcripts Sources Description
Veterans517Library of Congress VHP, Nevada WWII, Niles Library, Wisconsin Veterans MuseumMilitary service narratives
Immigration402University of Minnesota, Densho Digital ArchiveImmigration and assimilation experiences
Regional314University of Nevada Reno, Kentucky Oral History CommissionRegional and community histories
Depression Era213Federal Writers' Project (Library of Congress)Great Depression oral histories
Japanese American156Densho Digital ArchiveJapanese American internment and post-war
Academic153Columbia University Oral History, Smithsonian Archives of American ArtAcademic and university histories
September 1172National Park Service 9/11 Memorial9/11 experiences and aftermath
Civil Rights68Civil Rights History Project (Library of Congress)Civil rights movement narratives
COVID-1942Various oral history projectsPandemic experiences
Labor30Labor Archives and Research CenterLabor movement histories
Refugee27Voices of Conscience, UNHCR collectionsRefugee experiences
Total1,99411 thematic domains, 25+ institutional archives

3.3 Pipeline Stages

The benchmark construction proceeds in four stages, illustrated in Figure 1.

Stage 1: Transcript Cleaning. Raw oral history transcripts are cleaned and converted to structured JSON format, preserving the first-person narrative voice while removing interviewer questions and metadata artifacts.

Stage 2: Named Entity Recognition. GPT-4.1-mini performs NER on each transcript, identifying named entities of seven types: Place, Organization, Person, Event, Work, Military Unit, and Other. Each extracted entity is linked to Wikidata (QID) and Wikipedia where possible. This stage produces 31,284 entity mentions across 1,752 transcript files (87.9% coverage).

Stage 3: Explicit Summary Generation. For each (transcript, entity) pair, GPT-4.1-mini generates a first-person narrative summary focused on that entity, preserving the contextual cues surrounding the entity's mention. This produces 25,161 explicit summaries from 1,601 transcript files (80.3% coverage).

Stage 4: Implicit Rewriting. Each explicit summary is rewritten by GPT-4.1-mini to remove all direct mentions of the entity name while preserving all contextual cues. The entity reference is replaced with generic descriptions (e.g., "Attack on Pearl Harbor" becomes "the attack on the naval base in Hawaii"). This produces the final 25,136 implicit rewrites (80.2% coverage), which form the benchmark's implicit_text field. The slight reduction from Stage 3 reflects a small number of cases where the entity could not be satisfactorily anonymized.

IRC-Bench construction pipeline

Figure 1: IRC-Bench construction pipeline. Raw oral history transcripts undergo cleaning, named entity recognition with Wikidata linking, entity-grounded narrative generation, and entity elision to produce implicit entity recognition evaluation samples.

3.4 Entity Knowledge Base

The entity knowledge base contains 12,337 unique entities with the following metadata coverage: 84.6% have associated Wikipedia pages, 70.9% have LLM-generated descriptions, and 51.2% have alternative names sourced from Wikidata. Entity representations in the knowledge base serve multiple roles: as retrieval targets for closed-world experiments, as alias sources for evaluation matching, and as description inputs for embedding-based approaches.

3.5 Entity-Level Train/Dev/Test Splitting

To ensure rigorous evaluation, the dataset is split at the entity level rather than the sample level. All samples for a given entity appear in exactly one partition, preventing information leakage where a model might learn entity-specific patterns from training examples and exploit them at test time. The split uses a 70/10/20 ratio (seed=42).

Table 2: IRC-Bench partition statistics. Entity-level splits ensure zero overlap between train, dev, and test entities.

PartitionSamplesEntities
Train17,9718,635
Dev2,5321,234
Test4,6332,468
Total25,13612,337

3.6 Entity Type Distribution

The dataset exhibits a natural long-tail distribution over entity types. Places dominate (47.3%), reflecting oral histories' emphasis on geographic locations. Organizations (21.3%) and Persons (13.7%) follow, while specialized types such as Events, Works, and Military Units are less frequent but still well represented. Table 3 reports the full distribution.

Table 3: Distribution of IRC-Bench samples by entity type.

Entity TypeSamples% of TotalUnique Entities
Place11,89347.3%4,821
Organization5,36621.3%2,894
Person3,45013.7%2,207
Event2,1628.6%1,102
Work1,1954.8%743
Military Unit5372.1%312
Other5332.1%258
Total25,136100%12,337
Dataset composition showing entity type distribution and domain breakdown

Figure 2: IRC-Bench dataset composition. Left: distribution of samples across entity types. Right: distribution across thematic domains.

3.7 Example Samples

Figure 3 presents three EGN/EEN pairs illustrating the diversity of implicit references in IRC-Bench.

Example 1: Person

EGN

"Rosa Parks was arrested on December 5, 1955, in Montgomery, Alabama, for refusing to give up her bus seat, an act that sparked the Montgomery bus boycott. E. D. Nixon called me late that night to inform me of her arrest and to urge action."

EEN

"A woman was arrested on December 5, 1955, in Montgomery, Alabama, for refusing to give up her bus seat, an act that sparked the Montgomery bus boycott. A local leader called me late that night to inform me of her arrest and to urge action."

Gold: Rosa Parks (Q41921)  |  Cues: December 5 1955, Montgomery Alabama, bus seat refusal, bus boycott

Example 2: Event

EGN

"I headed the relief committee during the disastrous Berkeley Fire of 1923, helping to coordinate aid and recovery efforts for the community. This was a challenging time for Berkeley, California, and I took an active role in organizing support to help residents rebuild."

EEN

"I headed the relief committee during the disastrous fire of 1923 in a California city, helping to coordinate aid and recovery efforts for the community. This was a challenging time for the city, and I took an active role in organizing support to help residents rebuild."

Gold: Berkeley Fire of 1923 (Q4561337)

Example 3: Organization (5 cues)

EGN

"After leaving the Navy in 1966, I worked in the warehouse at Montgomery Ward in Redwood City. It was a non-union job and pretty low-key, just me and an older lady doing pricing and warehouse work."

EEN

"After leaving the Navy in 1966, I worked in the warehouse at a national department store in Redwood City. It was a non-union job and pretty low-key, just me and an older lady doing pricing and warehouse work."

Gold: Montgomery Ward (Q3046) | Cues: Navy 1966, warehouse, national department store, Redwood City, non-union

Figure 3: Three EGN/EEN pairs from IRC-Bench. Blue highlights mark explicit entity mentions in EGNs; red highlights show the elided descriptions in EENs. Examples span Event, Event, and Organization types, demonstrating how distributed cues (dates, locations, roles, institutions) jointly identify the entity.

3.8 Benchmark Calibration

To validate the quality and difficulty calibration of IRC-Bench, we conducted an automated quality assessment on 500 randomly sampled test instances using GPT-4o as an evaluator. Each sample was assessed along two dimensions: narrative naturalness (1 to 5 scale) and cue-based recoverability (whether a knowledgeable human could identify the entity from the EEN alone, given the entity identity for reference).

The EEN narratives achieve a mean naturalness score of 4.87 out of 5, with 87% of samples rated at the maximum score, confirming that the entity elision process produces fluent, natural-sounding first-person text. For recoverability, 42.0% of samples were judged as recoverable (5.8% "yes," 36.2% "probably"), 7.2% as "possible with expertise," and 50.8% as "unlikely" or "no." This distribution indicates well-calibrated difficulty: the benchmark is challenging enough to be non-trivial (half the samples resist even informed human judgment) yet solvable enough to reward strong models. Notably, the 42% recoverability rate closely matches the performance of the best systems (O10 QLoRA at 41.4% alias match; C5 DPR at 42.8% alias Hit@1), suggesting that top models are approaching the practical ceiling imposed by the available contextual cues. Cue sufficiency was rated at a mean of 3.0 out of 5, confirming moderate overall difficulty with substantial variance across samples.

Benchmark quality validation results

Figure 4: IRC-Bench quality validation (n=500, GPT-4o judge). Left: distribution of naturalness scores (mean 4.87/5). Right: recoverability judgments showing well-calibrated difficulty.


4. Methodology

4.1 Task Formulation

Implicit entity recognition, the task of identifying entities that are contextually referenced but never explicitly named, was first studied by Hosseini [21] in the context of tweets. We adopt the same core objective and extend it to long-form reminiscence narratives: given a first-person narrative text \(t\) that implicitly references a named entity \(e\) without ever mentioning \(e\) by name, the task is to identify \(e\). The text \(t\) contains contextual cues (dates, locations, events, people, roles, descriptions) that jointly constrain the identity of \(e\), but the model must synthesize these cues and draw on world knowledge to produce the correct entity name.

We evaluate implicit entity recognition under two formulations:

Open-world formulation. The model generates the entity name as free-form text, without access to a candidate set. This tests the model's ability to recall entities from its parametric knowledge. The open-world setting is more realistic, as it does not assume a closed inventory of possible entities.

Closed-world formulation. The model ranks all 12,337 entities in the knowledge base by relevance to the query text, selecting the highest-ranked candidate. This tests the model's ability to match implicit descriptions to entity representations via embedding similarity. The closed-world setting provides Hit@K metrics and is analogous to entity linking with a fixed knowledge base.

4.2 The Non-Locality Property

We define a key structural property that distinguishes implicit entity recognition from span-based entity tasks. Let \(C(T, e^*) = \{c_1, c_2, \ldots, c_n\}\) denote the set of textual cues in \(T\) that collectively identify \(e^*\). In standard NER and EL, the entity is localized: there exists a contiguous span \(m\) that is sufficient to identify \(e^*\). In implicit entity recognition, the entity is non-local:

$$\nexists \; m \subset T \;\text{s.t.}\; m \text{ is contiguous} \wedge m \Rightarrow e^*$$ $$\text{but}\; C(T,e^*) \Rightarrow e^*,\; c_i \text{ non-contiguous}$$

That is, no single contiguous substring of \(T\) is sufficient to identify \(e^*\), but the set of distributed cues collectively determines it. This non-locality has direct implications for method design: approaches that rely on local span matching (NER, EL) or single-vector passage encoding (dense retrieval) are structurally disadvantaged relative to approaches that can integrate information across the full text (LLMs with sufficient context windows).

We empirically validate non-locality by comparing GPT-4o zero-shot accuracy on full implicit texts versus individual sentences in isolation (n=200). Full-text accuracy reaches 33.5%, while single-sentence accuracy drops to 12.9%, a gap of 20.6 percentage points. This confirms that entity recognition requires integrating cues distributed across the entire passage; no single sentence carries sufficient information in the majority of cases.

4.3 Open-World Methods

4.3.1 LLM Generative Approach

We evaluate LLMs in a generative setting where each model receives the implicit text and must produce the entity name. All direct-prompting models use temperature 0.0 (greedy decoding) and a maximum of 100 output tokens. We test zero-shot (ZS) and few-shot (FS, 5 fixed demonstrations) prompting strategies. Few-shot exemplars are selected to cover diverse entity types and are held constant across all test samples. Complete prompt templates appear in Appendix A.

4.3.2 Models

We evaluate four LLM families in the open-world setting: GPT-4o [37] and GPT-4.1-mini (via OpenAI Batch API), and Llama 3.1 8B Instruct [35, 36] via OpenRouter API). For GPT-4o, GPT-4.1-mini, and Llama 3.1 8B, we additionally evaluate chain-of-thought (CoT) prompting, which instructs the model to reason step-by-step before producing the final answer. CoT experiments use temperature 0.7 and a maximum of 300 output tokens to accommodate the reasoning trace.

4.3.3 QLoRA Fine-tuning (O10)

We fine-tune Llama 3.1 8B Instruct using QLoRA (Quantized Low-Rank Adaptation) [33, 34] for implicit entity recognition. The model is trained to generate the entity name given the implicit text, using the standard causal language modeling objective. The entity-level splitting guarantees zero overlap between training and test entities, so the fine-tuned model cannot memorize entity-specific patterns; it must learn to generalize the implicit-to-entity mapping to entirely unseen entities. Key training parameters include 4-bit NF4 quantization, LoRA rank 16, alpha 32, learning rate 2e-4, and 2 epochs of training on the full train split (17,971 samples). Full hyperparameters are reported in Appendix B.

4.4 Closed-World Methods

In the closed-world setting, we encode both the implicit query text and all 12,337 entity representations into a shared embedding space, then rank entities by cosine similarity. We explore three entity representation strategies: Name (the entity name alone), Description (the entity name concatenated with its LLM-generated description), and Wiki (the first sentence from the entity's Wikipedia article). For entities lacking a description or Wikipedia text, we fall back to the next available representation.

4.4.1 BGE-base Baseline (C1, C2, C3)

We use BAAI/bge-base-en-v1.5 [31] as our baseline embedding model. This 110M-parameter model produces 768-dimensional embeddings and ranks among the top general-purpose bi-encoders on the MTEB benchmark. Embeddings are L2-normalized before computing cosine similarity.

4.4.2 DPR Fine-tuning (C4, C5, C6)

We fine-tune BGE-base using a Dense Passage Retrieval (DPR) approach [30] with Multiple Negatives Ranking Loss (MNRL). Each training pair consists of an implicit text (query) and its gold entity representation (positive passage). MNRL uses in-batch negatives: for a batch of \(B\) query-positive pairs, each positive for one query serves as a negative for all other queries, providing \(B-1\) negatives per sample without explicit hard negative mining. We train for 3 epochs with batch size 48 and learning rate 2e-5. Three separate models are trained, one for each entity representation strategy.

4.5 RAG Baseline (RAG1)

We implement a Retrieval-Augmented Generation (RAG) baseline that combines embedding retrieval with LLM reranking. The pipeline operates in two stages. First, BGE-base with entity descriptions (C2 configuration) retrieves the top-5 candidate entities for each implicit query. Second, GPT-4.1-mini receives the implicit text along with the 5 candidates (with their descriptions) and selects the most likely entity or suggests a better one. This approach tests whether an LLM can effectively rerank retrieved candidates to improve over pure embedding retrieval.


5. Evaluation Protocol

5.1 Matching Hierarchy

Entity names can be expressed in multiple valid forms (e.g., "United States Marine Corps" vs. "USMC" vs. "Marines"). To account for this variation, we employ a four-tier matching hierarchy, applied in order of decreasing strictness:

Tier 1 (Exact match): The prediction and gold entity are identical after lowercasing and whitespace trimming.

Tier 2 (Alias match): The prediction matches one of the gold entity's known aliases from Wikidata. For example, predicting "NYC" for gold entity "New York City" is an alias match.

Tier 3 (Containment match): The prediction is a substring of the gold entity, or vice versa. For example, predicting "Pearl Harbor" for "Attack on Pearl Harbor" qualifies as a containment match.

Tier 4 (Jaccard match): The token-level Jaccard similarity between the prediction and gold entity is at least 0.5. This captures partial overlaps where the prediction includes most of the relevant tokens.

A prediction is considered correct at a given tier if it matches at that tier or any stricter tier. When reporting alias-aware accuracy (the primary metric for open-world experiments), we count any prediction that achieves Tier 1 or Tier 2 as correct.

5.2 Metrics

Open-world experiments report exact match (Tier 1), alias match (Tiers 1+2), containment match (Tiers 1+2+3), and Jaccard match (all four tiers). Closed-world experiments report Hit@K (K = 1, 3, 5, 10), Mean Reciprocal Rank (MRR), and alias-aware Hit@1 (where a hit counts if any alias of the gold entity appears in the top-K).

5.3 Statistical Significance

To assess whether performance differences between methods are statistically significant, we use McNemar's test (with continuity correction) on the paired per-sample outcomes from each pair of compared systems. Additionally, we compute bootstrap confidence intervals (1,000 resamples, seed=42) at the 95% level.


6. Results and Analysis

6.1 Open-World Performance

Table 4 presents the open-world results across all experimental configurations. The QLoRA-adapted Llama 3.1 8B (O10) achieves the highest exact match accuracy at 38.94%, substantially outperforming all other open-world methods. Among non-fine-tuned models, GPT-4o with few-shot prompting (O2) is the strongest at 31.62% exact match, rising to 41.10% under the full four-tier Jaccard evaluation.

Model scale is the dominant factor for zero-shot performance: moving from Llama 3.1 8B (13.92%) to GPT-4.1-mini (25.71%) to GPT-4o (27.02%) yields consistent gains. Few-shot prompting consistently improves performance across all model sizes (p < 0.001 by McNemar's test). The improvement ranges from +2.95 percentage points for GPT-4.1-mini to +4.60 points for GPT-4o. The few-shot examples appear to calibrate the model's output format and entity granularity, reducing cases where models produce entity types instead of specific entity names.

Table 4: Open-world results on the IRC-Bench test set (n=4,633). Exact = exact string match; Alias = alias-aware match; Contain = containment match; Jaccard = Jaccard match (≥0.5). Best result in each column is highlighted.

ID Model Mode Exact (%) Alias (%) Contain (%) Jaccard (%)
O1GPT-4oZero-shot27.0233.3033.3035.05
O2GPT-4oFew-shot31.6238.9438.9441.10
O3GPT-4.1-miniZero-shot25.7127.0933.5035.94
O4GPT-4.1-miniFew-shot28.6636.8936.8939.48
O5Llama 3.1 8BZero-shot13.9214.8119.4720.18
O6Llama 3.1 8BFew-shot17.8318.8024.6125.66
O10Llama 3.1 8B (QLoRA)Fine-tuned38.9441.4247.9051.59
O11/bGPT-4.1-mini CoTt=0.7 / t=0.018.93 / 19.4420.27 / 20.7626.48 / 26.8727.69 / 28.10
O12/bGPT-4o CoTt=0.7 / t=0.022.51 / 25.5723.89 / 33.5430.91 / 37.2132.33 / 38.92
O13Llama 3.1 8B CoTt=0.76.226.6911.7212.24
RAG1BGE + GPT-4.1-miniRAG19.7120.5328.7529.55
Main results comparison across all methods

Figure 4: Comparison of open-world and closed-world methods on the IRC-Bench test set. Open-world methods are measured by exact match and alias-aware accuracy; closed-world methods by Hit@1 and Hit@10.

The most striking open-world result is the effect of QLoRA fine-tuning. O10 (QLoRA Llama 3.1 8B) achieves 38.94% exact match, nearly tripling the base model's zero-shot performance (13.92%) and exceeding GPT-4o few-shot (31.62%) by 7.32 percentage points. At the Jaccard level, O10 reaches 51.59%, meaning more than half of all test predictions are at least partially correct. This is particularly notable given the entity-level split: O10 has never seen any of the 2,468 test entities during training, demonstrating genuine generalization of the implicit-to-entity mapping.

The failure of chain-of-thought prompting is equally striking. CoT reduces GPT-4o accuracy from 33.30% (zero-shot alias) to 23.89%, and GPT-4.1-mini from 25.71% (zero-shot exact) to 18.93%. CoT also degrades Llama 3.1 8B from 13.92% (zero-shot exact) to 6.22%. We analyze the reasons for this failure in Section 7.

The hybrid RAG approach (19.71% exact match) underperforms even GPT-4.1-mini zero-shot (25.71%). When the gold entity does not appear among the top-5 candidates (which occurs in roughly 67% of cases with BGE-base, given C2's Hit@5 of 33.41%), the LLM reranker cannot recover it.

6.2 Closed-World Performance

Table 5 shows the closed-world retrieval results. Fine-tuned DPR with description representations (C5) achieves the best performance: 35.38% Hit@1, 71.49% Hit@10, and 0.4751 MRR. With alias-aware evaluation, C5 reaches 42.80% Hit@1 and 74.47% Hit@10.

Table 5: Closed-world retrieval results on the IRC-Bench test set. The candidate set contains all 12,337 entities. Best result in each column is highlighted. Alias columns report alias-aware metrics.

ID Retriever Entity Repr. Hit@1 (%) Hit@3 (%) Hit@5 (%) Hit@10 (%) MRR Alias H@1 (%)
C1BGE (off-the-shelf)Name16.5126.3830.9736.760.236222.08
C2BGE (off-the-shelf)Description16.6427.7833.4140.600.248021.78
C3BGE (off-the-shelf)Wiki14.3825.1029.9237.320.221119.32
C4DPR (fine-tuned)Name30.0046.3653.6663.310.413137.10
C5DPR (fine-tuned)Description35.3853.5161.8271.490.475142.80
C6DPR (fine-tuned)Wiki27.9544.9851.8259.550.385134.38
Hit@K curves for closed-world methods

Figure 5: Hit@K curves for closed-world retrieval methods. Fine-tuned DPR with description representations (C5) substantially outperforms all baseline configurations across all K values.

The comparison between off-the-shelf BGE and fine-tuned DPR reveals the magnitude of domain adaptation benefits. DPR fine-tuning more than doubles Hit@1 for all entity representation types: Name (16.51% to 30.00%, +13.49 pp), Description (16.64% to 35.38%, +18.74 pp), and Wiki (14.38% to 27.95%, +13.57 pp). The largest absolute gain occurs for descriptions, indicating that fine-tuning is especially effective at learning to align the narrative cue structure with the rich attribute content in entity descriptions.

Across both retrieval architectures, entity description representations consistently outperform name-only and Wikipedia representations. Descriptions provide a concise, attribute-rich summary that aligns well with the contextual cues present in elided narratives. Wikipedia lead sentences, despite containing more information, introduce noise from tangential content.

DPR fine-tuning improvement over BGE baseline

Figure 6: Effect of DPR fine-tuning on retrieval performance. Fine-tuning more than doubles Hit@1 across all entity representation strategies, with the largest absolute gain for descriptions (+18.74 pp).

6.3 Cross-Paradigm Comparison

Table 6 ranks the top-performing systems across both paradigms under a unified alias-aware Hit@1 metric.

Table 6: Cross-paradigm ranking by alias-aware Hit@1. Open-world methods use the 4-tier alias evaluation; closed-world methods use alias-aware Hit@1.

RankSystemParadigmAlias H@1 (%)
1O10 (QLoRA Llama 8B)Open51.59
2C5 (DPR + Description)Closed42.80
3O2 (GPT-4o FS)Open41.10
4O4 (GPT-4.1-mini FS)Open39.48
5C4 (DPR + Name)Closed37.10
6O3 (GPT-4.1-mini ZS)Open35.94
7O1 (GPT-4o ZS)Open35.05
8C6 (DPR + Wiki)Closed34.38

The fine-tuned QLoRA model (O10) leads by a substantial margin, achieving 51.59% Jaccard accuracy. The fine-tuned DPR retriever (C5) ranks second at 42.80% alias-aware Hit@1, outperforming GPT-4o few-shot (41.10%). This is notable because C5 uses only a 110M-parameter embedding model, while GPT-4o is estimated at well over 100B parameters.

6.4 Per-Entity-Type Analysis

Performance varies substantially by entity type. Table 7 reports the alias-aware Hit@1 (all tiers) for selected methods.

Table 7: Hit@1 (%) by entity type (alias-aware, all tiers). n denotes the number of test samples of each type.

Entity Type n O1 (GPT-4o ZS) O2 (GPT-4o FS) O5 (Llama 8B ZS) C1 (BGE Name) C2 (BGE Desc)
Place2,07638.1543.8818.1614.8815.99
Organization1,15238.2845.3127.3429.1727.17
Person69823.8224.0714.9018.3418.62
Event27334.4350.1827.1148.3547.25
Work21532.0939.5314.4239.0736.74
Military Unit12126.4537.1910.7423.9731.40
Other9830.6136.7321.4333.6725.51
Heatmap of performance by entity type and method

Figure 7: Heatmap of performance (alias-aware Hit@1) by entity type and method. Person entities are consistently the hardest across all methods; Events are notably strong for both open-world and closed-world approaches.

Persons are the hardest type for open-world methods. GPT-4o FS achieves only 24.07% on Person entities, compared to 45.31% on Organizations and 43.88% on Places. Person entities often have less distinctive contextual cues and are more likely to be obscure individuals not well represented in model training data.

Events are notably strong for closed-world methods. BGE achieves 48.35% Hit@1 on Events, higher than any other type, suggesting that event descriptions provide distinctive semantic signatures that align well with implicit event narratives.

Few-shot examples disproportionately help Events. GPT-4o jumps from 34.43% (ZS) to 50.18% (FS) on Events (+15.75 pp), the largest per-type improvement, likely because the few-shot examples include two Event instances (Attack on Pearl Harbor).

6.5 Error Analysis

We performed automated error classification on 200 randomly sampled incorrect predictions from each of O1 through O6, using GPT-4.1-mini to categorize errors. Table 8 reports the distribution.

Table 8: Error type distribution (%) over 200 randomly sampled incorrect predictions per model. Categories are mutually exclusive.

Error Type O1 O2 O3 O4 O5 O6
Same-type, unrelated43.042.043.545.052.046.0
Wrong type28.527.529.522.531.035.0
Same-type, related24.525.522.524.013.517.0
Partial match3.54.03.06.02.51.5
Empty/hallucination0.51.01.52.51.00.0

The dominant error mode across all models is same-type, unrelated (42% to 52%), where the model predicts an entity of the correct type but one that is semantically unrelated to the gold entity (e.g., predicting "Jack Johnson" when the gold is "Lou Ambers," both boxers). The second most common error is wrong type (22.5% to 35.0%), where the model predicts an entity of an entirely different category. Same-type, related errors (13.5% to 25.5%) represent near-misses where the prediction is semantically close to the gold (e.g., predicting "Okinawa" for "Iwo Jima"). Hallucinations and empty responses are rare (<2.5%), indicating that models reliably produce plausible entity names even when incorrect.

Llama 3.1 8B (O5, O6) shows a higher proportion of same-type, unrelated errors (52.0% and 46.0%) and a lower proportion of same-type, related errors (13.5% and 17.0%) compared to GPT models (O1, O2: 24.5% and 25.5%). This suggests that smaller models have weaker ability to narrow down candidates within a type using fine-grained contextual cues.

6.6 Key Findings Summary

We summarize the principal findings as a numbered list, with each claim supported by specific experimental comparisons:

Finding 1: Fine-tuning is the most impactful intervention. QLoRA fine-tuning of Llama 3.1 8B raises exact match from 13.92% (O5, zero-shot) to 38.94% (O10), a 2.80x improvement. DPR fine-tuning of BGE raises Hit@1 from 16.64% (C2) to 35.38% (C5), a 2.13x improvement. Both gains are achieved despite zero entity overlap between training and test sets.

Finding 2: QLoRA fine-tuning yields the overall best performance. O10 achieves 38.94% exact match (51.59% Jaccard), surpassing GPT-4o few-shot (31.62% exact, 41.10% Jaccard) by 7.32 pp on exact match and 10.49 pp on Jaccard. This result is achieved with only 6.5M trainable parameters on top of an 8B-parameter base.

Finding 3: Chain-of-thought degrades all models. CoT reduces GPT-4o from 33.30% (ZS alias) to 23.89% (a 28.3% relative drop), GPT-4.1-mini from 25.71% (ZS exact) to 18.93% (a 26.4% drop), and Llama 3.1 8B from 13.92% (ZS exact) to 6.22% (a 55.3% drop). To rule out temperature as a confounding factor (CoT experiments used t=0.7 vs. t=0.0 for direct prompting), we repeated O11 and O12 at t=0.0. For GPT-4.1-mini, the effect is negligible (+0.5pp alias), confirming that CoT structurally degrades performance on this task. For GPT-4o, lowering temperature recovers 9.6pp (alias rising from 23.9% to 33.5%), reaching parity with zero-shot (33.3%) but not exceeding it. This indicates that for GPT-4o, the temperature difference accounts for the majority of the observed CoT penalty, while the reasoning structure itself neither helps nor hurts. For smaller models, CoT is genuinely harmful regardless of temperature.

Finding 4: Few-shot prompting consistently helps. Adding 5 demonstrations improves GPT-4o from 27.02% to 31.62% (+4.60 pp), GPT-4.1-mini from 25.71% to 28.66% (+2.95 pp), and Llama 3.1 8B from 13.92% to 17.83% (+3.91 pp). All differences are significant (p < 0.001).

Finding 5: Entity descriptions are the best retrieval representation. C5 (DPR+Desc) outperforms C4 (DPR+Name) by 5.38 pp on Hit@1 (35.38% vs. 30.00%) and C6 (DPR+Wiki) by 7.43 pp (35.38% vs. 27.95%). The pattern holds for off-the-shelf BGE as well.

Finding 6: RAG underperforms direct LLM inference. RAG1 (19.71% exact match) is 5.99 pp below GPT-4.1-mini zero-shot (25.71%) and 8.95 pp below GPT-4.1-mini few-shot (28.66%). The retrieval bottleneck is the limiting factor.

Finding 7: Model scale matters substantially in the zero-shot regime. GPT-4o ZS (27.02%) outperforms Llama 3.1 8B ZS (13.92%) by 13.10 pp (McNemar chi-squared = 432.28, p < 0.001, with 892 vs. 203 discordant pairs).

Finding 8: The retriever's Hit@10 reveals strong latent signal. C5 achieves 71.49% Hit@10, meaning the gold entity is in the top-10 for nearly three-quarters of queries. Combining DPR shortlists with LLM reranking is a promising direction.

6.7 Statistical Significance

All key comparisons are statistically significant at p < 0.001 (McNemar's test with continuity correction). Table 9 reports the detailed results.

Table 9: Statistical significance tests (McNemar's test with continuity correction). All p-values < 0.001. A-only and B-only report the number of discordant pairs.

Comparison Acc A (%) Acc B (%) McNemar χ² A-only B-only
O1 vs O2 (ZS vs FS, GPT-4o)35.0641.11149.69120400
O3 vs O4 (ZS vs FS, mini)35.1638.7236.18150275
O1 vs O5 (GPT-4o vs Llama 8B)35.0620.19432.28892203
O1 vs C2 (Open vs Closed)35.0622.58181.931,204626

The discordant pair counts are informative: for the GPT-4o vs. Llama 8B comparison, 892 samples are solved only by GPT-4o while only 203 are solved only by Llama 8B, demonstrating a strong directional advantage. For the ZS vs. FS comparison on GPT-4o (O1 vs. O2), 400 samples are gained while only 120 are lost, confirming that few-shot examples provide a net benefit with limited trade-offs. The 95% bootstrap confidence intervals confirm non-overlapping ranges for all reported comparisons.


7. Discussion

The most counterintuitive finding is the failure of chain-of-thought prompting. CoT improves performance on mathematical reasoning and multi-hop QA by decomposing complex reasoning into intermediate steps [32], yet it degrades implicit entity recognition for every model tested. The explanation lies in the gestalt nature of the task: identifying an implicit entity requires simultaneously attending to a constellation of distributed cues and matching this constellation against parametric knowledge. When forced to reason step by step, models fixate on individual cues in isolation, arriving at locally plausible but globally incorrect entities. Temperature control experiments (O11b, O12b) confirm this is structural for smaller models (GPT-4.1-mini: +0.5pp at t=0.0, still 6.3pp below ZS) while for GPT-4o, controlling temperature recovers the gap to ZS parity (33.5% vs 33.3%) without exceeding it. CoT therefore neither helps nor hurts large models once temperature is controlled, but genuinely harms smaller models that lack capacity to maintain holistic context while verbalizing reasoning.

The RAG pipeline underperforms direct generation (19.71% vs 28.66% for GPT-4.1-mini FS) because dense retrievers encode the EEN as a single vector, losing fine-grained cue information through the bottleneck. Retrieved candidates are topically related but often incorrect, and when presented as context they can override the model's own correct intuition. With BGE-base, the gold entity appears in the top-5 only 33.41% of the time (C2 Hit@5), severely limiting the reranker.

The success of QLoRA (O10: 38.94% exact, up from 13.92% base) despite zero entity overlap warrants explanation. Fine-tuning teaches three transferable skills: the task format (extracting a single canonical name, avoiding verbose hedging), cue integration patterns (which temporal, spatial, and relational cue combinations are diagnostic), and entity type priors (calibrating expectations to reduce wrong-type errors). The model learns "how to solve implicit entity puzzles" rather than memorizing specific answers. Comparing the best open-world (O10: 38.94%) and closed-world (C5: 35.38% Hit@1, 42.80% alias) results, both paradigms reach roughly comparable alias-level performance, though the closed-world Hit@10 of 71.49% suggests that combining fine-tuned retrieval shortlists with fine-tuned LLM reranking is a promising future direction.

Entity difficulty correlates with cue specificity and knowledge-base neighborhood density. Events achieve 50.18% (GPT-4o FS) due to unique date and participant combinations, while Persons reach only 24.07% because generic biographical attributes (occupation, era, region) are shared by many candidates. This pattern is consistent across all methods.

Several limitations should be acknowledged. The benchmark covers English-language oral histories focused primarily on American experiences. The LLM-generated entity elision, though validated (naturalness 4.87/5, 42% recoverable by informed assessment), may not perfectly replicate naturally occurring implicit references. The alias-aware evaluation still penalizes semantically correct predictions using unregistered surface forms. Temperature controls were not run for Llama 3.1 8B CoT. Finally, QLoRA training used max_seq=192 tokens, which truncates approximately 6% of test prompts.

Model size vs. accuracy scatter plot

Figure 8: Relationship between model scale and open-world accuracy. Larger models achieve substantially higher accuracy, with the relationship appearing roughly log-linear in model parameter count. QLoRA fine-tuning (O10) breaks this trend, enabling an 8B model to outperform much larger models.


8. Conclusion

We have extended implicit entity recognition, previously studied in short social-media text [21, 22], to the domain of long-form reminiscence narratives, formalizing the non-locality property that distinguishes this setting and empirically validating it through a sentence-level ablation showing a 20.6pp accuracy gap between full-text and single-sentence inference. We release IRC-Bench, a benchmark of 25,136 samples spanning 12,337 Wikidata-linked entities from 1,994 oral history transcripts across 11 thematic domains. Our systematic evaluation across 19 experimental configurations reveals eight key findings.

First, fine-tuning is the single most impactful intervention. QLoRA-adapted Llama 3.1 8B achieves 38.94% exact match (51.59% Jaccard), nearly tripling the base model's zero-shot performance and surpassing GPT-4o few-shot by 7.32 percentage points, despite the entity-level split ensuring zero overlap with training entities. In the closed-world setting, DPR fine-tuning of BGE-base more than doubles Hit@1 from 16.64% to 35.38%, with the gold entity appearing in the top-10 for 71.49% of queries.

Second, chain-of-thought prompting degrades smaller models (by 4.51 to 7.70 pp), while temperature control experiments reveal that for GPT-4o, the observed CoT penalty is largely attributable to the higher sampling temperature rather than the reasoning structure itself. In all cases, CoT fails to exceed zero-shot performance, confirming that implicit entity recognition requires holistic pattern matching rather than sequential reasoning. Third, retrieval-augmented generation underperforms direct LLM inference due to the non-locality of implicit cues. Fourth, model scale is the dominant factor in zero-shot open-world accuracy, with performance spanning from 13.92% (Llama 3.1 8B) to 27.02% (GPT-4o) in exact match. Fifth, entity descriptions are consistently the best representation for dense retrieval, outperforming both entity names and Wikipedia lead sentences.

Future work should explore several promising directions: multi-modal implicit entity recognition incorporating audio features from the original recordings, cross-lingual benchmarks constructed from oral history archives in other languages, active learning approaches that combine fine-tuned DPR shortlists with fine-tuned LLM reranking (leveraging C5's 71.49% Hit@10), and the development of specialized architectures that explicitly model the non-locality property of implicit entity cues through structured attention over distributed text spans.


References

[1] Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3-26.

[2] Li, J., Sun, A., Han, J., and Li, C. (2022). A survey on deep learning for named entity recognition. IEEE TKDE, 34(1):50-70.

[3] Ganea, O.-E. and Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. In Proc. EMNLP, pages 2619-2629.

[4] Kolitsas, N., Ganea, O.-E., and Hofmann, T. (2018). End-to-end neural entity linking. In Proc. CoNLL, pages 519-529.

[5] Lee, K., He, L., Lewis, M., and Zettlemoyer, L. (2017). End-to-end neural coreference resolution. In Proc. EMNLP, pages 188-197.

[6] Boyd, D. A. (2012). Achieving the promise of oral history in a digital age. In Ritchie, D. A., editor, The Oxford Handbook of Oral History. Oxford University Press.

[7] Lazar, A., Demiris, G., and Thompson, H. (2016). Evaluation of a multifunctional technology system in a memory care unit: Opportunities for innovation in dementia care. Informatics for Health and Social Care, 41(4):373-389.

[8] Subramaniam, P. and Woods, B. (2012). The impact of individual reminiscence therapy for people with dementia. Expert Review of Neurotherapeutics, 12(5):545-555.

[9] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. In Proc. NAACL, pages 260-270.

[10] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, pages 4171-4186.

[11] Xie, T., Li, Q., Zhang, J., Zhang, Y., Liu, Z., and Wang, H. (2023). Empirical study of zero-shot NER with ChatGPT. In Proc. EMNLP, pages 7935-7956.

[12] Ashok, D. and Lipton, Z. C. (2023). PromptNER: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444.

[13] Sang, E. T. K. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task. In Proc. CoNLL, pages 142-147.

[14] Malmasi, S., et al. (2022). MultiCoNER: A large-scale multilingual dataset for complex named entity recognition. In Proc. COLING.

[15] Li, J., Fei, H., Liu, J., Wu, S., Zhang, M., Teng, C., Ji, D., and Li, F. (2022). Unified named entity recognition as word-word relation classification. In Proc. AAAI.

[16] Zhou, W., Zhang, S., Gu, Y., Chen, M., and Poon, H. (2024). UniversalNER: Targeted distillation from large language models for open named entity recognition. In Proc. ICLR.

[17] Wu, L., Petroni, F., Josifoski, M., Riedel, S., and Zettlemoyer, L. (2020). Scalable zero-shot entity linking with dense entity retrieval. In Proc. EMNLP, pages 6397-6407.

[18] De Cao, N., Izacard, G., Riedel, S., and Petroni, F. (2021). Autoregressive entity retrieval. In Proc. ICLR.

[19] Ayoola, T., Tyagi, S., Fisher, J., Christodoulopoulos, C., and Pierleoni, A. (2022). ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking. In Proc. NAACL (Industry Track).

[20] Botha, J. A., Shan, Z., and Gillick, D. (2020). Entity linking in 100 languages. In Proc. EMNLP, pages 7833-7845.

[21] Hosseini, H. (2022). Implicit entity recognition and linking in tweets. PhD thesis, Toronto Metropolitan University.

[22] Hosseini, H. and Bagheri, E. (2021). Learning to rank implicit entities on Twitter. Information Processing & Management, 58(3):102503.

[23] Perera, N., Dehmer, M., and Emmert-Streib, F. (2020). Named entity recognition and relation detection for biomedical information extraction. Frontiers in Cell and Developmental Biology, 8:673.

[24] Treder, M. S., Lee, S., and Tsvetanov, K. A. (2024). Introduction to large language models (LLMs) for dementia care and research. Frontiers in Dementia, 3:1385303.

[25] Broadbent, E., Stafford, R., and MacDonald, B. (2009). Acceptance of healthcare robots for the older population: Review and future directions. International Journal of Social Robotics, 1(4):319-330.

[26] de Jager, A., Fogarty, A., Tewson, A., Lenette, C., and Boydell, K. M. (2017). Digital storytelling in research: A systematic review. The Qualitative Report, 22(10):2548-2582.

[27] Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proc. EMNLP.

[28] Petroni, F., Rocktaschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. (2019). Language models as knowledge bases? In Proc. EMNLP.

[29] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. NeurIPS.

[30] Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. (2020). Dense passage retrieval for open-domain question answering. In Proc. EMNLP, pages 6769-6781.

[31] Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., and Nie, J.-Y. (2023). C-Pack: Packaged resources to advance general Chinese embedding. arXiv preprint arXiv:2309.07597.

[32] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS.

[33] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In Proc. ICLR.

[34] Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized language models. In Proc. NeurIPS.

[35] Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[36] Dubey, A., et al. (2024). The Llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[37] OpenAI (2024). GPT-4o system card. Technical Report.

[38] Butler, R. N. (1963). The life review: An interpretation of reminiscence in the aged. Psychiatry, 26(1), 65-76.

[39] Webster, J. D. (1993). Construction and validation of the Reminiscence Functions Scale. Journal of Gerontology, 48(5), P256-P262.

[40] Nikitina, S., Callaioli, S., and Baez, M. (2018). Smart conversational agents for reminiscence. Proceedings of the 1st International Workshop on Software Engineering for Cognitive Services, 52-57.

[41] Pessanha, F. and Akdag Salah, A. (2022). A computational look at oral history archives. ACM Journal on Computing and Cultural Heritage, 15(1):6:1-6:16.

[42] Hou, Y. (2020). Bridging anaphora resolution as question answering. In Proceedings of the 58th Annual Meeting of the ACL, pages 1428-1438.

[43] Poesio, M., Stuckardt, R., and Versley, Y. (2016). Anaphora Resolution: Algorithms, Resources, and Applications. Springer.


Appendix A: Prompt Templates

A.1 Zero-Shot Prompt

System message:

You are an entity recognition expert. Given a text that implicitly references a named entity without mentioning it, identify what entity is being referenced.

User message:

What named entity is implicitly referenced in this text? The entity is never mentioned by name. Text: "{text}" Think about the contextual cues (dates, places, events, people, roles) and identify the specific named entity being referenced. Answer with ONLY the entity name (canonical Wikipedia name), nothing else.

Parameters: temperature=0.0, max_tokens=100

A.2 Few-Shot Prompt

System message:

You are an entity recognition expert. Given a text that implicitly references a named entity without mentioning it, identify what entity is being referenced.

User message:

What named entity is implicitly referenced in this text? The entity is never mentioned by name. Examples: Text: "I remember that Sunday morning in December '41. We were listening to the radio when the news broke about the attack on the naval base in Hawaii. That's when everything changed." Entity: Attack on Pearl Harbor Text: "I enlisted right out of high school and went to boot camp in San Diego. As an aircraft mechanic, I was sent to the Pacific." Entity: United States Marine Corps Text: "After the surrender, we flew into the main islands. I landed in the bay and spent six months there for the occupation. The capital was flattened by the B-29s." Entity: Tokyo Text: "In late 1941, I was set to ship out from San Francisco. A friend ran up saying they're bombing the base in Hawaii." Entity: Attack on Pearl Harbor Text: "Growing up in that bustling metropolis with towering skyscrapers, I was immersed in a vibrant culture." Entity: New York City Now identify the entity in this text: Text: "{text}" Answer with ONLY the entity name (canonical Wikipedia name), nothing else.

Parameters: temperature=0.0, max_tokens=100

A.3 Chain-of-Thought Prompt

System message:

You are an entity recognition expert. Think step by step.

User message:

What named entity is implicitly referenced in this text? The entity is never mentioned by name. Text: "{text}" Think step by step: 1. What contextual cues are present? (dates, places, events, people, roles) 2. What type of entity do these cues suggest? (Person, Place, Organization, Event) 3. What specific named entity matches ALL these cues? Reasoning: [your step-by-step analysis] Entity: [canonical Wikipedia name]

Parameters: temperature=0.7, max_tokens=300

A.4 RAG Prompt

User message (no system message):

This text implicitly references a named entity without naming it. Based on the contextual cues, which candidate is most likely? Text: "{text}" Candidates: 1. {candidate_1} - {description_1} 2. {candidate_2} - {description_2} 3. {candidate_3} - {description_3} 4. {candidate_4} - {description_4} 5. {candidate_5} - {description_5} If none match well, suggest a better entity. Answer: [number]. [entity name]

Parameters: temperature=0.7, max_tokens=50

A.5 QLoRA Fine-tuning Prompt (O10)

<|begin_of_text|><|start_header_id|>system<|end_header_id|> You identify implicitly referenced entities.<|eot_id|> <|start_header_id|>user<|end_header_id|> What entity is implicitly referenced? Answer with only the entity name. Text: {implicit_text}<|eot_id|> <|start_header_id|>assistant<|end_header_id|> {entity}<|eot_id|>

Parameters: greedy decoding, max_new_tokens=30

Appendix B: Training Hyperparameters

B.1 DPR (Dense Passage Retrieval) Fine-tuning

Base modelBAAI/bge-base-en-v1.5
Model parameters~110M
Embedding dimension768
Training examples17,971
Epochs3
Batch size48
Learning rate2e-5
Warmup steps100
Loss functionMultipleNegativesRankingLoss (MNRL)
OptimizerAdamW
Mixed precisionFP16 (AMP)
NegativesIn-batch (47 negatives per sample)
Random seed42

Three separate models are trained, one for each entity representation (name, description, wiki). All use identical hyperparameters.

B.2 QLoRA (O10) Fine-tuning

Base modelmeta-llama/Llama-3.1-8B-Instruct
Model parameters~8B (base); ~6.5M trainable (LoRA)
Quantization4-bit NormalFloat (NF4)
Compute dtypebfloat16
LoRA rank (r)16
LoRA alpha (α)32
LoRA dropout0.05
Target modulesq_proj, v_proj, k_proj, o_proj
Training examples17,971
Epochs2
Per-device batch size48
Gradient accumulation1 (effective batch: 48)
Learning rate2e-4
Max sequence length192 tokens
Warmup steps50
Precisionbfloat16
Validation samples500 (subset of dev)
FrameworkTRL SFTTrainer + PEFT

Appendix C: Example Predictions

We present five correct and five incorrect predictions from GPT-4o few-shot (O2) and selected O10 predictions, to illustrate the task characteristics and failure modes.

C.1 Correct Predictions (O2: GPT-4o Few-shot)

Correct Example 1

"I studied at a major public university in Northern California, where I was part of the Design Department. During my time there, I combined academic courses with art classes, focusing on three-dimensional design..."

Gold: University of California, Berkeley  |  Prediction: University of California, Berkeley  

Correct Example 2

"I was born in the capital city of Germany in the early 1930s. It was a turbulent time as the political climate was rapidly changing. My family decided to leave that city in 1938 to escape the dangers posed by the Nazi regime. That move shaped much of my early life and future."

Gold: Berlin  |  Prediction: Berlin  

Correct Example 3

"When I first came to America, I worked in a Pacific island territory for eight months on a sugar plantation. I was only 15 years old and worked under a Chinese boss for $18 a month..."

Gold: Hawaii  |  Prediction: Hawaii  

Correct Example 4

"My grandfather was a teenager during the major 1950s political upheaval in our Caribbean homeland and once found a journal from someone fighting with the revolutionary leader. That period was filled with fear for my family and community. The uprising brought about communism, which had some positive effects like high literacy rates, but also caused extreme poverty and suffering. The memories of that era shape how older immigrants from that island view politics in the United States today."

Gold: Cuban Revolution  |  Prediction: Cuban Revolution  

Correct Example 5

"He and I were close when he was Senate majority leader, and he was very cordial to me when I first came to the Senate. He gave me important committee assignments, including chairing the Calendar Committee and seats on the Agricultural and Finance Committees. He was probably the most able majority leader in history, knowing the Senate's personalities and how to motivate them. As President, he overcommitted on social programs, which I believe contributed to the huge deficits we face today."

Gold: Lyndon B. Johnson  |  Prediction: Lyndon B. Johnson  

These correct examples demonstrate cases where geographic cues ("capital city of Germany," "Pacific island territory"), temporal markers ("1950s," "Senate majority leader"), and contextual details (the Nazi regime, sugar plantations, Caribbean communism) are sufficiently distinctive for the model to identify entities across Place, Event, and Person types.

C.2 Incorrect Predictions (O2: GPT-4o Few-shot)

Incorrect Example 1: Wrong type

"My great-grandfather left Lithuania in the early 1900s to escape oppression and seek a better life in America. He arrived before World War I and worked hard to establish himself, eventually sending for his family..."

Gold: Solomon Goodman  |  Prediction: Lithuanian Jews  

The model predicted a group/category rather than the specific individual being described.

Incorrect Example 2: Wrong type

"I once kept a newspaper clipping of a write-up about a historic estate by a famous 19th-century author, but unfortunately, I have misplaced it. The author's writing gave me some insight into the estate..."

Gold: Harriet Beecher Stowe  |  Prediction: Monticello  

The model focused on the "historic estate" cue rather than the "famous 19th-century author" cue.

Incorrect Example 3: Wrong type

"The priest who taught me algebra and later became the bishop of a diocese in eastern Washington questioned my presence in his advanced algebra class because I lacked the necessary background..."

Gold: Bishop of Spokane  |  Prediction: West Point  

The model produced a completely unrelated entity, likely confusing the religious context.

Incorrect Example 4: Same-type, related

"Remote healthcare delivery became a critical part of how we provided care during the viral outbreak. Initially, we relied on phone calls, but within weeks, our organization quickly implemented video..."

Gold: Telehealth  |  Prediction: COVID-19 pandemic  

The model identified the correct general domain but predicted the contextual event rather than the practice being described.

Incorrect Example 5: Same-type, near miss

"He was my Ph.D. advisor at the California university starting in 1956. He was a brilliant economist who later won the Nobel Prize, and studying under him greatly influenced my academic development. His mentorship helped shape my approach to economics and game theory."

Gold: Kenneth Arrow  |  Prediction: John Forbes Nash Jr.  

The model predicted a Nobel laureate economist associated with game theory, but confused the advisor (Arrow, at Stanford) with another famous figure in the same field.

These errors illustrate the principal challenges of implicit entity recognition: distinguishing the referenced entity from related contextual entities (Examples 4, 5), resolving references to obscure individuals (Examples 1, 3), and focusing on the correct cue among multiple competing signals (Example 2). Example 5 is particularly instructive: both Kenneth Arrow and John Nash are Nobel laureate economists linked to game theory, but the cues (Ph.D. advisor, California, 1956) point specifically to Arrow at Stanford.