Skip to content

Phase 2: Concept Extraction

This phase has two sub-phases that work together to identify and extract technical concepts from publications.

Phase 2a: OpenAlex Concept Enrichment

Module: src/openalex_enricher.py

Objective

Supplement each paper with structured concept and topic data from the OpenAlex API, providing field/discipline classification.

Process

  1. For each paper with an OpenAlex work ID, query the OpenAlex API
  2. Extract concepts, topics, and primary field classification
  3. Cache responses locally to avoid repeated API calls
  4. Add enrichment columns to the publications DataFrame

Added Columns

Column Type Description
openalex_concepts string (JSON) List of concepts with scores
openalex_topics string (JSON) Topic classifications
openalex_field string Primary field (e.g., "Physics", "Chemistry")

Running

uv run python main.py --phase 2    # Runs both 2a and 2b
uv run python -m src.openalex_enricher  # Standalone

Phase 2b: LLM Concept Extraction

Module: src/concept_extractor.py

Objective

Use GPT-4o-mini to extract structured technical concepts from paper titles, abstracts, and keywords. This is the core intelligence layer that identifies what concepts a paper introduces vs. applies.

Tiered Strategy

Tier Selection Criteria Method Papers
Tier 1 Award-related + citation ≥ 500 Full LLM extraction ~1,000
Tier 2 Citation ≥ 50 Keywords + OpenAlex concepts ~30,000
Tier 3 All others Keyword-only coarse tagging Remaining

LLM Extraction Schema

For each paper, the LLM returns structured JSON:

{
  "concepts": [
    {
      "name": "Ubiquitin-Proteasome Pathway",
      "field": "Biology",
      "subfield": "Molecular Biology",
      "relationship": "INTRODUCES",
      "confidence": 0.95,
      "cross_discipline_source": null
    },
    {
      "name": "Protein Degradation",
      "field": "Biology",
      "subfield": "Biochemistry",
      "relationship": "APPLIES",
      "confidence": 0.85,
      "cross_discipline_source": "Chemistry"
    }
  ]
}

Key Fields

Field Description
name Concept name (standardized)
field Primary disciplinary field
subfield More specific sub-discipline
relationship INTRODUCES (paper proposes it) or APPLIES (paper uses it)
confidence Extraction confidence score (0-1)
cross_discipline_source If the concept originated from another field

Output Files

File Location Description
llm_extraction_raw.json output/concepts/ Raw LLM responses for all processed papers
concepts.parquet output/concepts/ Flattened concept table

Running

uv run python main.py --phase 2      # Both 2a and 2b
uv run python main.py --skip-llm     # Skip LLM (2b only)
uv run python -m src.concept_extractor  # Standalone LLM extraction

API Cost

LLM extraction requires an OpenAI API key and incurs API costs. The tiered strategy limits Tier 1 to ~1,000 papers to keep costs manageable (approximately $2-5 with GPT-4o-mini).