Skip to content

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[0.1.0] — 2025-01-01

Added

  • Phase 1: Data loading and cleaning pipeline (data_loader.py)

    • CSV and JSONL data loading with Polars
    • OpenAlex inverted index abstract decoding
    • Parquet output format
    • Sample mode with 5 representative laureates
  • Phase 1.5: Content enrichment (content_enricher.py)

    • Semantic Scholar API integration
    • Open Access PDF download and text extraction (PyMuPDF)
    • Unpaywall API for OA link discovery
    • Multi-strategy enrichment with caching
  • Phase 2a: OpenAlex enrichment (openalex_enricher.py)

    • Concept and topic enrichment via OpenAlex API
    • Field classification
    • Response caching
  • Phase 2b: LLM concept extraction (concept_extractor.py)

    • GPT-4o-mini powered concept extraction
    • Structured JSON output with confidence scores
    • INTRODUCES vs APPLIES relationship detection
    • Cross-discipline source identification
  • Phase 3+4: Knowledge graph construction (graph_builder.py)

    • 5 node types: Laureate, Award, Work, Concept, Field
    • 9 edge types including CROSS_INSPIRED
    • JSON and GraphML export formats
    • Cross-disciplinary migration detection
  • Phase 5: Visualization (visualize.py)

    • Interactive network graph (Pyvis/vis.js)
    • Concept timeline (Plotly)
    • Cross-field heatmap (Plotly)
  • Phase 6: Insight analysis (insight_analyzer.py)

    • Hub concept identification
    • Field influence analysis
    • Temporal pattern detection
    • Key pathway discovery
    • Markdown + JSON report generation
  • Infrastructure:

    • YAML-based configuration (config/settings.yaml)
    • Environment variable support (.env)
    • Pipeline orchestrator (main.py) with phase selection
    • Comprehensive documentation with i18n support