src/concept_extractor¶
Phase 2b module for LLM-based technical concept extraction.
Constants¶
CONCEPT_EXTRACTION_PROMPT¶
System prompt template for GPT-4o-mini that instructs the model to:
- Extract 1–3 key technical concepts per paper
- Classify each as
INTRODUCES(paper proposes) orAPPLIES(paper uses) - Identify the field and subfield
- Detect cross-disciplinary sources
- Return structured JSON
Functions¶
extract_concepts_llm(paper: dict, client: OpenAI, model: str) → list[dict]¶
Extract structured technical concepts from a single paper using LLM.
Parameters:
| Name | Type | Description |
|---|---|---|
paper |
dict |
Paper data with keys: title, abstract, keywords, year |
client |
OpenAI |
Initialized OpenAI client |
model |
str |
Model name (e.g., gpt-4o-mini) |
Returns: List of concept dicts, each containing:
{
"name": str, # Concept name
"field": str, # Primary field
"subfield": str, # Sub-discipline
"relationship": str, # "INTRODUCES" | "APPLIES"
"confidence": float, # 0.0 - 1.0
"cross_discipline_source": str | None # Source field if cross-disciplinary
}
Raises: Returns empty list on API errors.
run() → None¶
Execute the full Phase 2b pipeline:
- Load
publications.parquetandawards.parquet - Initialize OpenAI client with config from
settings.yaml - Iterate through papers, calling LLM for concept extraction
- Save raw responses to
output/concepts/llm_extraction_raw.json - Flatten and save to
output/concepts/concepts.parquet
Requirements
Requires OPENAI_API_KEY environment variable to be set.