Phase 2: Concept Extraction¶
This phase has two sub-phases that work together to identify and extract technical concepts from publications.
Phase 2a: OpenAlex Concept Enrichment¶
Module: src/openalex_enricher.py
Objective¶
Supplement each paper with structured concept and topic data from the OpenAlex API, providing field/discipline classification.
Process¶
- For each paper with an OpenAlex work ID, query the OpenAlex API
- Extract
concepts,topics, and primaryfieldclassification - Cache responses locally to avoid repeated API calls
- Add enrichment columns to the publications DataFrame
Added Columns¶
| Column | Type | Description |
|---|---|---|
openalex_concepts |
string (JSON) | List of concepts with scores |
openalex_topics |
string (JSON) | Topic classifications |
openalex_field |
string | Primary field (e.g., "Physics", "Chemistry") |
Running¶
uv run python main.py --phase 2 # Runs both 2a and 2b
uv run python -m src.openalex_enricher # Standalone
Phase 2b: LLM Concept Extraction¶
Module: src/concept_extractor.py
Objective¶
Use GPT-4o-mini to extract structured technical concepts from paper titles, abstracts, and keywords. This is the core intelligence layer that identifies what concepts a paper introduces vs. applies.
Tiered Strategy¶
| Tier | Selection Criteria | Method | Papers |
|---|---|---|---|
| Tier 1 | Award-related + citation ≥ 500 | Full LLM extraction | ~1,000 |
| Tier 2 | Citation ≥ 50 | Keywords + OpenAlex concepts | ~30,000 |
| Tier 3 | All others | Keyword-only coarse tagging | Remaining |
LLM Extraction Schema¶
For each paper, the LLM returns structured JSON:
{
"concepts": [
{
"name": "Ubiquitin-Proteasome Pathway",
"field": "Biology",
"subfield": "Molecular Biology",
"relationship": "INTRODUCES",
"confidence": 0.95,
"cross_discipline_source": null
},
{
"name": "Protein Degradation",
"field": "Biology",
"subfield": "Biochemistry",
"relationship": "APPLIES",
"confidence": 0.85,
"cross_discipline_source": "Chemistry"
}
]
}
Key Fields¶
| Field | Description |
|---|---|
name |
Concept name (standardized) |
field |
Primary disciplinary field |
subfield |
More specific sub-discipline |
relationship |
INTRODUCES (paper proposes it) or APPLIES (paper uses it) |
confidence |
Extraction confidence score (0-1) |
cross_discipline_source |
If the concept originated from another field |
Output Files¶
| File | Location | Description |
|---|---|---|
llm_extraction_raw.json |
output/concepts/ |
Raw LLM responses for all processed papers |
concepts.parquet |
output/concepts/ |
Flattened concept table |
Running¶
uv run python main.py --phase 2 # Both 2a and 2b
uv run python main.py --skip-llm # Skip LLM (2b only)
uv run python -m src.concept_extractor # Standalone LLM extraction
API Cost
LLM extraction requires an OpenAI API key and incurs API costs. The tiered strategy limits Tier 1 to ~1,000 papers to keep costs manageable (approximately $2-5 with GPT-4o-mini).