Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[0.1.0] — 2025-01-01¶
Added¶
-
Phase 1: Data loading and cleaning pipeline (
data_loader.py)- CSV and JSONL data loading with Polars
- OpenAlex inverted index abstract decoding
- Parquet output format
- Sample mode with 5 representative laureates
-
Phase 1.5: Content enrichment (
content_enricher.py)- Semantic Scholar API integration
- Open Access PDF download and text extraction (PyMuPDF)
- Unpaywall API for OA link discovery
- Multi-strategy enrichment with caching
-
Phase 2a: OpenAlex enrichment (
openalex_enricher.py)- Concept and topic enrichment via OpenAlex API
- Field classification
- Response caching
-
Phase 2b: LLM concept extraction (
concept_extractor.py)- GPT-4o-mini powered concept extraction
- Structured JSON output with confidence scores
- INTRODUCES vs APPLIES relationship detection
- Cross-discipline source identification
-
Phase 3+4: Knowledge graph construction (
graph_builder.py)- 5 node types: Laureate, Award, Work, Concept, Field
- 9 edge types including CROSS_INSPIRED
- JSON and GraphML export formats
- Cross-disciplinary migration detection
-
Phase 5: Visualization (
visualize.py)- Interactive network graph (Pyvis/vis.js)
- Concept timeline (Plotly)
- Cross-field heatmap (Plotly)
-
Phase 6: Insight analysis (
insight_analyzer.py)- Hub concept identification
- Field influence analysis
- Temporal pattern detection
- Key pathway discovery
- Markdown + JSON report generation
-
Infrastructure:
- YAML-based configuration (
config/settings.yaml) - Environment variable support (
.env) - Pipeline orchestrator (
main.py) with phase selection - Comprehensive documentation with i18n support
- YAML-based configuration (