Skip to content

src/concept_extractor

Phase 2b module for LLM-based technical concept extraction.

Constants

CONCEPT_EXTRACTION_PROMPT

System prompt template for GPT-4o-mini that instructs the model to:

  • Extract 1–3 key technical concepts per paper
  • Classify each as INTRODUCES (paper proposes) or APPLIES (paper uses)
  • Identify the field and subfield
  • Detect cross-disciplinary sources
  • Return structured JSON

Functions

extract_concepts_llm(paper: dict, client: OpenAI, model: str) → list[dict]

Extract structured technical concepts from a single paper using LLM.

Parameters:

Name Type Description
paper dict Paper data with keys: title, abstract, keywords, year
client OpenAI Initialized OpenAI client
model str Model name (e.g., gpt-4o-mini)

Returns: List of concept dicts, each containing:

{
    "name": str,              # Concept name
    "field": str,             # Primary field
    "subfield": str,          # Sub-discipline
    "relationship": str,      # "INTRODUCES" | "APPLIES"
    "confidence": float,      # 0.0 - 1.0
    "cross_discipline_source": str | None  # Source field if cross-disciplinary
}

Raises: Returns empty list on API errors.


run() → None

Execute the full Phase 2b pipeline:

  1. Load publications.parquet and awards.parquet
  2. Initialize OpenAI client with config from settings.yaml
  3. Iterate through papers, calling LLM for concept extraction
  4. Save raw responses to output/concepts/llm_extraction_raw.json
  5. Flatten and save to output/concepts/concepts.parquet

Requirements

Requires OPENAI_API_KEY environment variable to be set.