Skip to content

Architecture

System Overview

The Nobel Prize Knowledge Graph is built as a multi-phase data pipeline that transforms raw Nobel Prize and publication data into an interactive, queryable knowledge graph.

flowchart TB
    subgraph Input["Input Data"]
        CSV[CSV Tables<br>757 laureates, 245K papers]
        JSON[JSON Files<br>2.3GB publication records]
    end

    subgraph Phase1["Phase 1: Data Foundation"]
        DL[data_loader.py<br>Load & Clean]
        CE[content_enricher.py<br>Enrich Content]
    end

    subgraph Phase2["Phase 2: Concept Extraction"]
        OA[openalex_enricher.py<br>API Enrichment]
        LLM[concept_extractor.py<br>LLM Extraction]
    end

    subgraph Phase34["Phase 3+4: Graph Construction"]
        GB[graph_builder.py<br>Build Knowledge Graph]
    end

    subgraph Phase5["Phase 5: Output"]
        VIZ[visualize.py<br>Interactive HTML]
        IA[insight_analyzer.py<br>Analysis Reports]
    end

    CSV --> DL
    JSON --> DL
    DL --> CE
    CE --> OA
    OA --> LLM
    LLM --> GB
    GB --> VIZ
    GB --> IA

    style Input fill:#fff3e0
    style Phase1 fill:#e3f2fd
    style Phase2 fill:#f3e5f5
    style Phase34 fill:#e8f5e9
    style Phase5 fill:#fce4ec

Data Flow

Phase-by-Phase Data Transformation

Phase Input Process Output
1 CSV + JSONL Load, decode abstracts, clean laureates.parquet, awards.parquet, publications.parquet, institutions.parquet
1.5 publications.parquet Semantic Scholar API, OA PDFs Updated publications.parquet with enriched abstracts/fulltext
2a publications.parquet OpenAlex API Updated publications.parquet with concepts/topics/fields
2b publications.parquet + awards.parquet GPT-4o-mini LLM llm_extraction_raw.json, concepts.parquet
3+4 All Parquet + concepts NetworkX graph construction knowledge_graph.json, nodes.json, edges.json, .graphml
5 knowledge_graph.json Pyvis + Plotly network.html, timeline.html, heatmap.html
6 knowledge_graph.json Graph analysis algorithms insight_report.md, insight_report.json

Storage Formats

flowchart LR
    RAW[Raw Data] -->|CSV/JSON| PARQUET[Parquet Files]
    PARQUET --> KG[Knowledge Graph]
    KG -->|JSON| JSON_OUT[knowledge_graph.json]
    KG -->|GraphML| GRAPHML[knowledge_graph.graphml]
    KG -->|Separate| NODES[nodes.json + edges.json]
    KG -->|HTML| VIZ[Interactive Visualizations]

Core Design Decisions

1. Tiered Concept Extraction Strategy

Not all 245K papers receive the same treatment. The system uses a three-tier approach to balance cost and quality:

Tier Papers Method Cost
Tier 1 ~1,000 award-related + top-cited LLM deep extraction High
Tier 2 ~30,000 highly-cited Keywords + OpenAlex concepts Medium
Tier 3 Remaining Keyword-only coarse tagging Low

2. Sample Mode for Development

The pipeline supports a sample mode that processes only 5 representative laureates. This enables rapid iteration without processing the full 2.3GB dataset.

3. Aggressive Caching

All external API calls (OpenAlex, Semantic Scholar, Unpaywall) are cached to JSON files:

output/openalex_cache/          # OpenAlex work data
output/openalex_cache/fulltext/ # Full text from various sources
output/openalex_cache/semantic_scholar/  # S2 API responses
output/openalex_cache/unpaywall/         # Unpaywall OA links

4. Streaming Large Files

The 2.3GB publication_records.json is processed as line-delimited JSON (JSONL) with streaming reads, avoiding full file loading into memory.

5. NetworkX as Primary Graph Engine

NetworkX was chosen as the graph engine for its:

  • Rich algorithm library (shortest paths, centrality, community detection)
  • Easy serialization to JSON and GraphML
  • No external database dependency

For production scaling, the schema is designed to be compatible with Neo4j.

Module Dependency Graph

flowchart TD
    INIT[src/__init__.py<br>Config Loader] --> DL
    INIT --> CE
    INIT --> OA
    INIT --> CX
    INIT --> GB
    INIT --> VIZ
    INIT --> IA

    DL[data_loader.py] --> CE[content_enricher.py]
    CE --> OA[openalex_enricher.py]
    OA --> CX[concept_extractor.py]
    DL --> GB[graph_builder.py]
    CX --> GB
    GB --> VIZ[visualize.py]
    GB --> IA[insight_analyzer.py]

    MAIN[main.py<br>Pipeline Runner] -.->|orchestrates| DL
    MAIN -.-> CE
    MAIN -.-> OA
    MAIN -.-> CX
    MAIN -.-> GB
    MAIN -.-> VIZ
    MAIN -.-> IA

Technology Stack

Layer Technology Purpose
Data Processing Polars / Pandas CSV/JSON loading, cleaning, transformation
NLP / Extraction OpenAI API (GPT-4o-mini) Technical concept extraction from abstracts
Data Enrichment OpenAlex REST API Paper concepts, topics, field classification
Content Enrichment Semantic Scholar API, Unpaywall Abstract and full-text retrieval
PDF Processing PyMuPDF PDF text extraction
Graph Engine NetworkX In-memory graph operations and algorithms
Visualization Pyvis, Plotly Interactive HTML visualizations
Serialization JSON, GraphML, Parquet Data persistence
Environment uv Python package and environment management
Configuration YAML + dotenv Settings and secrets management