Skip to content

Pipeline Overview

The Nobel Prize Knowledge Graph is built through a 6-phase pipeline, each handling a specific aspect of the data transformation process.

Pipeline Phases

flowchart LR
    P1[Phase 1<br>Data Loading] --> P15[Phase 1.5<br>Content Enrichment]
    P15 --> P2a[Phase 2a<br>OpenAlex Enrichment]
    P2a --> P2b[Phase 2b<br>LLM Extraction]
    P2b --> P34[Phase 3+4<br>Graph Construction]
    P34 --> P5[Phase 5<br>Visualization]
    P34 --> P6[Phase 6<br>Insight Analysis]
Phase Module Time Est. Description
1 data_loader.py ~2 days Load CSV/JSON, decode abstracts, clean and output Parquet
1.5 content_enricher.py ~1 day Enrich papers via Semantic Scholar, OA PDFs
2a openalex_enricher.py ~1 day Add concepts/topics/fields from OpenAlex API
2b concept_extractor.py ~2 days Extract structured concepts via GPT-4o-mini
3+4 graph_builder.py ~2 days Build NetworkX directed graph, export JSON/GraphML
5 visualize.py ~1 day Generate 3 interactive HTML visualizations
6 insight_analyzer.py ~1 day Analyze graph, generate Markdown insight report

Running the Pipeline

Full Run

uv run python main.py

Phase Selection

uv run python main.py --phase 1      # Data loading only
uv run python main.py --phase 1.5    # Content enrichment only
uv run python main.py --phase 2      # Concept extraction (2a + 2b)
uv run python main.py --phase 3      # Graph construction
uv run python main.py --phase 5      # Visualization
uv run python main.py --phase 6      # Insight analysis

Skip Options

uv run python main.py --skip-llm     # Skip LLM extraction (Phase 2b)
uv run python main.py --skip-enrich  # Skip content enrichment (Phase 1.5)

Data Dependencies

Each phase depends on outputs from previous phases:

flowchart TD
    CSV[CSV Files] --> P1
    JSONL[publication_records.json] --> P1

    P1[Phase 1] -->|laureates.parquet<br>awards.parquet<br>publications.parquet<br>institutions.parquet| P15[Phase 1.5]

    P15 -->|publications.parquet<br>enriched| P2a[Phase 2a]

    P2a -->|publications.parquet<br>with concepts| P2b[Phase 2b]

    P2b -->|concepts.parquet<br>llm_extraction_raw.json| P34[Phase 3+4]
    P1 -->|All Parquet files| P34

    P34 -->|knowledge_graph.json| P5[Phase 5]
    P34 -->|knowledge_graph.json| P6[Phase 6]

Output Summary

Phase Output Files Location
1 laureates.parquet, awards.parquet, publications.parquet, institutions.parquet output/clean_data/
1.5 Updated publications.parquet output/clean_data/
2a Updated publications.parquet output/clean_data/
2b llm_extraction_raw.json, concepts.parquet output/concepts/
3+4 knowledge_graph.json, nodes.json, edges.json, knowledge_graph.graphml output/graph/
5 network.html, timeline.html, heatmap.html output/viz/
6 insight_report.md, insight_report.json output/reports/