`src/concept_extractor`¶

Phase 2b module for LLM-based technical concept extraction.

Constants¶

`CONCEPT_EXTRACTION_PROMPT`¶

System prompt template for GPT-4o-mini that instructs the model to:

Extract 1–3 key technical concepts per paper
Classify each as INTRODUCES (paper proposes) or APPLIES (paper uses)
Identify the field and subfield
Detect cross-disciplinary sources
Return structured JSON

Functions¶

`extract_concepts_llm(paper: dict, client: OpenAI, model: str) → list[dict]`¶

Extract structured technical concepts from a single paper using LLM.

Parameters:

Name	Type	Description
`paper`	`dict`	Paper data with keys: `title`, `abstract`, `keywords`, `year`
`client`	`OpenAI`	Initialized OpenAI client
`model`	`str`	Model name (e.g., `gpt-4o-mini`)

Returns: List of concept dicts, each containing:

{
    "name": str,              # Concept name
    "field": str,             # Primary field
    "subfield": str,          # Sub-discipline
    "relationship": str,      # "INTRODUCES" | "APPLIES"
    "confidence": float,      # 0.0 - 1.0
    "cross_discipline_source": str | None  # Source field if cross-disciplinary
}

Raises: Returns empty list on API errors.

`run() → None`¶

Execute the full Phase 2b pipeline:

Load publications.parquet and awards.parquet
Initialize OpenAI client with config from settings.yaml
Iterate through papers, calling LLM for concept extraction
Save raw responses to output/concepts/llm_extraction_raw.json
Flatten and save to output/concepts/concepts.parquet

Requirements

Requires OPENAI_API_KEY environment variable to be set.

src/concept_extractor¶