Architecture¶

System Overview¶

The Nobel Prize Knowledge Graph is built as a multi-phase data pipeline that transforms raw Nobel Prize and publication data into an interactive, queryable knowledge graph.

flowchart TB
    subgraph Input["Input Data"]
        CSV[CSV Tables<br>757 laureates, 245K papers]
        JSON[JSON Files<br>2.3GB publication records]
    end

    subgraph Phase1["Phase 1: Data Foundation"]
        DL[data_loader.py<br>Load & Clean]
        CE[content_enricher.py<br>Enrich Content]
    end

    subgraph Phase2["Phase 2: Concept Extraction"]
        OA[openalex_enricher.py<br>API Enrichment]
        LLM[concept_extractor.py<br>LLM Extraction]
    end

    subgraph Phase34["Phase 3+4: Graph Construction"]
        GB[graph_builder.py<br>Build Knowledge Graph]
    end

    subgraph Phase5["Phase 5: Output"]
        VIZ[visualize.py<br>Interactive HTML]
        IA[insight_analyzer.py<br>Analysis Reports]
    end

    CSV --> DL
    JSON --> DL
    DL --> CE
    CE --> OA
    OA --> LLM
    LLM --> GB
    GB --> VIZ
    GB --> IA

    style Input fill:#fff3e0
    style Phase1 fill:#e3f2fd
    style Phase2 fill:#f3e5f5
    style Phase34 fill:#e8f5e9
    style Phase5 fill:#fce4ec

Data Flow¶

Phase-by-Phase Data Transformation¶

Phase	Input	Process	Output
1	CSV + JSONL	Load, decode abstracts, clean	`laureates.parquet`, `awards.parquet`, `publications.parquet`, `institutions.parquet`
1.5	`publications.parquet`	Semantic Scholar API, OA PDFs	Updated `publications.parquet` with enriched abstracts/fulltext
2a	`publications.parquet`	OpenAlex API	Updated `publications.parquet` with concepts/topics/fields
2b	`publications.parquet` + `awards.parquet`	GPT-4o-mini LLM	`llm_extraction_raw.json`, `concepts.parquet`
3+4	All Parquet + concepts	NetworkX graph construction	`knowledge_graph.json`, `nodes.json`, `edges.json`, `.graphml`
5	`knowledge_graph.json`	Pyvis + Plotly	`network.html`, `timeline.html`, `heatmap.html`
6	`knowledge_graph.json`	Graph analysis algorithms	`insight_report.md`, `insight_report.json`

Storage Formats¶

flowchart LR
    RAW[Raw Data] -->|CSV/JSON| PARQUET[Parquet Files]
    PARQUET --> KG[Knowledge Graph]
    KG -->|JSON| JSON_OUT[knowledge_graph.json]
    KG -->|GraphML| GRAPHML[knowledge_graph.graphml]
    KG -->|Separate| NODES[nodes.json + edges.json]
    KG -->|HTML| VIZ[Interactive Visualizations]

Core Design Decisions¶

1. Tiered Concept Extraction Strategy¶

Not all 245K papers receive the same treatment. The system uses a three-tier approach to balance cost and quality:

Tier	Papers	Method	Cost
Tier 1	~1,000 award-related + top-cited	LLM deep extraction	High
Tier 2	~30,000 highly-cited	Keywords + OpenAlex concepts	Medium
Tier 3	Remaining	Keyword-only coarse tagging	Low

2. Sample Mode for Development¶

The pipeline supports a sample mode that processes only 5 representative laureates. This enables rapid iteration without processing the full 2.3GB dataset.

3. Aggressive Caching¶

All external API calls (OpenAlex, Semantic Scholar, Unpaywall) are cached to JSON files:

output/openalex_cache/          # OpenAlex work data
output/openalex_cache/fulltext/ # Full text from various sources
output/openalex_cache/semantic_scholar/  # S2 API responses
output/openalex_cache/unpaywall/         # Unpaywall OA links

4. Streaming Large Files¶

The 2.3GB publication_records.json is processed as line-delimited JSON (JSONL) with streaming reads, avoiding full file loading into memory.

5. NetworkX as Primary Graph Engine¶

NetworkX was chosen as the graph engine for its:

Rich algorithm library (shortest paths, centrality, community detection)
Easy serialization to JSON and GraphML
No external database dependency

For production scaling, the schema is designed to be compatible with Neo4j.

Module Dependency Graph¶

flowchart TD
    INIT[src/__init__.py<br>Config Loader] --> DL
    INIT --> CE
    INIT --> OA
    INIT --> CX
    INIT --> GB
    INIT --> VIZ
    INIT --> IA

    DL[data_loader.py] --> CE[content_enricher.py]
    CE --> OA[openalex_enricher.py]
    OA --> CX[concept_extractor.py]
    DL --> GB[graph_builder.py]
    CX --> GB
    GB --> VIZ[visualize.py]
    GB --> IA[insight_analyzer.py]

    MAIN[main.py<br>Pipeline Runner] -.->|orchestrates| DL
    MAIN -.-> CE
    MAIN -.-> OA
    MAIN -.-> CX
    MAIN -.-> GB
    MAIN -.-> VIZ
    MAIN -.-> IA

Technology Stack¶

Layer	Technology	Purpose
Data Processing	Polars / Pandas	CSV/JSON loading, cleaning, transformation
NLP / Extraction	OpenAI API (GPT-4o-mini)	Technical concept extraction from abstracts
Data Enrichment	OpenAlex REST API	Paper concepts, topics, field classification
Content Enrichment	Semantic Scholar API, Unpaywall	Abstract and full-text retrieval
PDF Processing	PyMuPDF	PDF text extraction
Graph Engine	NetworkX	In-memory graph operations and algorithms
Visualization	Pyvis, Plotly	Interactive HTML visualizations
Serialization	JSON, GraphML, Parquet	Data persistence
Environment	uv	Python package and environment management
Configuration	YAML + dotenv	Settings and secrets management