Skip to content

Getting Started

This guide walks you through setting up the Nobel Prize Knowledge Graph project and running the pipeline.

Prerequisites

Requirement Version Purpose
Python >= 3.12 Runtime
uv Latest Package management
OpenAI API Key LLM concept extraction (Phase 2b)

Installation

1. Clone the Repository

git clone https://github.com/example/nobel-kg.git
cd nobel-kg

2. Install Dependencies

uv add polars pandas pyarrow networkx pyvis plotly requests openai pyyaml python-dotenv tqdm scikit-learn pymupdf

3. Configure Environment

Create a .env file in the project root:

OPENAI_API_KEY=sk-your-api-key-here
OPENAI_BASE_URL=https://api.openai.com/v1  # Optional: custom endpoint

4. Verify Data

Ensure the data directory contains the required files:

ls data/26963326/db_data/
# Expected: author.csv, award_info.csv, institution.csv, laureate.csv, ...

ls data/26963326/json/
# Expected: award_details.json, laureate_info.json, publication_records.json

Running the Pipeline

Full Pipeline

uv run python main.py

This runs all 6 phases sequentially:

  1. Phase 1 — Data Loading & Cleaning
  2. Phase 1.5 — Content Enrichment (Semantic Scholar + OA PDFs)
  3. Phase 2a — OpenAlex Concept Enrichment
  4. Phase 2b — LLM Concept Extraction
  5. Phase 3+4 — Knowledge Graph Construction
  6. Phase 5 — Visualization
  7. Phase 6 — Insight Analysis

Run Specific Phases

# Run only data loading
uv run python main.py --phase 1

# Run only graph construction
uv run python main.py --phase 3

# Run only visualization
uv run python main.py --phase 5

Skip Optional Steps

# Skip LLM extraction (saves API costs)
uv run python main.py --skip-llm

# Skip content enrichment
uv run python main.py --skip-enrich

Run Individual Modules

Each module can also be run standalone:

uv run python -m src.data_loader
uv run python -m src.openalex_enricher
uv run python -m src.concept_extractor
uv run python -m src.graph_builder
uv run python -m src.visualize
uv run python -m src.insight_analyzer

Viewing Outputs

Visualizations

Start a local HTTP server to view the interactive HTML visualizations:

python3 -m http.server 8765 --directory output/viz

Then open in your browser:

Reports

The insight report is generated as Markdown:

cat output/reports/insight_report.md

Knowledge Graph

The graph data is stored in multiple formats:

# Full graph (JSON)
output/graph/knowledge_graph.json

# Separate node/edge files
output/graph/nodes.json
output/graph/edges.json

# GraphML format (for Gephi, Cytoscape, etc.)
output/graph/knowledge_graph.graphml

# Simplified concept-only graph
output/graph/concept_graph_simplified.json
output/graph/concept_graph_simplified_nodes.json
output/graph/concept_graph_simplified_edges.json
output/graph/concept_graph_simplified.graphml

Sample Mode

By default, the pipeline runs in sample mode using 5 representative laureates across 4 fields:

ID Name Field Year
745 A. Michael Spence Economics 2001
102 Aage N. Bohr Physics 1975
779 Aaron Ciechanover Chemistry 2004
114 Abdus Salam Physics 1979
843 Ada E. Yonath Chemistry 2009

This produces a manageable graph (~97 nodes, ~181 edges) suitable for development and testing.

Troubleshooting

Common Issues

Missing API Key

If you see OpenAI API key not found, ensure your .env file contains a valid OPENAI_API_KEY.

Large JSON File

The publication_records.json file is 2.3 GB. The pipeline uses streaming/sample loading to keep memory usage reasonable.

Rate Limits

OpenAlex API and Semantic Scholar API have rate limits. The pipeline includes automatic rate limiting and caching to handle this gracefully.

Next Steps

  • Read the Architecture guide to understand the system design
  • Explore the Pipeline documentation for each phase
  • Check the Configuration reference for customization options