Phase 2: Concept Extraction¶

This phase has two sub-phases that work together to identify and extract technical concepts from publications.

Phase 2a: OpenAlex Concept Enrichment¶

Module: src/openalex_enricher.py

Objective¶

Supplement each paper with structured concept and topic data from the OpenAlex API, providing field/discipline classification.

Process¶

For each paper with an OpenAlex work ID, query the OpenAlex API
Extract concepts, topics, and primary field classification
Cache responses locally to avoid repeated API calls
Add enrichment columns to the publications DataFrame

Added Columns¶

Column	Type	Description
`openalex_concepts`	string (JSON)	List of concepts with scores
`openalex_topics`	string (JSON)	Topic classifications
`openalex_field`	string	Primary field (e.g., "Physics", "Chemistry")

Running¶

uv run python main.py --phase 2    # Runs both 2a and 2b
uv run python -m src.openalex_enricher  # Standalone

Phase 2b: LLM Concept Extraction¶

Module: src/concept_extractor.py

Objective¶

Use GPT-4o-mini to extract structured technical concepts from paper titles, abstracts, and keywords. This is the core intelligence layer that identifies what concepts a paper introduces vs. applies.

Tiered Strategy¶

Tier	Selection Criteria	Method	Papers
Tier 1	Award-related + citation ≥ 500	Full LLM extraction	~1,000
Tier 2	Citation ≥ 50	Keywords + OpenAlex concepts	~30,000
Tier 3	All others	Keyword-only coarse tagging	Remaining

LLM Extraction Schema¶

For each paper, the LLM returns structured JSON:

{
  "concepts": [
    {
      "name": "Ubiquitin-Proteasome Pathway",
      "field": "Biology",
      "subfield": "Molecular Biology",
      "relationship": "INTRODUCES",
      "confidence": 0.95,
      "cross_discipline_source": null
    },
    {
      "name": "Protein Degradation",
      "field": "Biology",
      "subfield": "Biochemistry",
      "relationship": "APPLIES",
      "confidence": 0.85,
      "cross_discipline_source": "Chemistry"
    }
  ]
}

Key Fields¶

Field	Description
`name`	Concept name (standardized)
`field`	Primary disciplinary field
`subfield`	More specific sub-discipline
`relationship`	`INTRODUCES` (paper proposes it) or `APPLIES` (paper uses it)
`confidence`	Extraction confidence score (0-1)
`cross_discipline_source`	If the concept originated from another field

Output Files¶

File	Location	Description
`llm_extraction_raw.json`	`output/concepts/`	Raw LLM responses for all processed papers
`concepts.parquet`	`output/concepts/`	Flattened concept table

Running¶

uv run python main.py --phase 2      # Both 2a and 2b
uv run python main.py --skip-llm     # Skip LLM (2b only)
uv run python -m src.concept_extractor  # Standalone LLM extraction

API Cost

LLM extraction requires an OpenAI API key and incurs API costs. The tiered strategy limits Tier 1 to ~1,000 papers to keep costs manageable (approximately $2-5 with GPT-4o-mini).