`src/graph_builder`¶

Phase 3+4 module for constructing the knowledge graph.

Functions¶

`build_graph(cfg: dict) → nx.DiGraph`¶

Build the complete knowledge graph from Parquet data files.

Parameters:

Name	Type	Description
`cfg`	`dict`	Project configuration dict

Returns: NetworkX directed graph containing all node types and edge types.

Process:

Load all Parquet files (laureates, awards, publications, institutions, concepts)
Add Laureate nodes with attributes
Add Award nodes + WON_AWARD edges
Add Work nodes + AUTHORED edges
Add Concept nodes + INTRODUCES/APPLIES edges
Add Field nodes + BELONGS_TO edges
Detect and build CROSS_INSPIRED edges
Build CITES edges from referenced works
Build AWARDED_FOR edges linking awards to concepts

`build_simplified_concept_graph(cfg: dict) → nx.DiGraph`¶

Build a simplified concept-only graph.

Features:

Deduplicates concepts across papers
Stores node-level citation_by_year dictionaries
Uses CONCEPT_CITES edges with year-wise citation dictionaries

Parameters:

Name	Type	Description
`cfg`	`dict`	Project configuration dict

Returns: nx.DiGraph containing only concept nodes.

`graph_to_json(G: nx.DiGraph) → dict`¶

Convert a NetworkX graph to a JSON-serializable dictionary.

Parameters:

Name	Type	Description
`G`	`nx.DiGraph`	NetworkX directed graph

Returns: Dict with structure:

{
    "nodes": [
        {"id": "laureate_779", "type": "Laureate", "name": "...", ...},
        ...
    ],
    "edges": [
        {"source": "laureate_779", "target": "award_...", "type": "WON_AWARD", ...},
        ...
    ]
}

`print_graph_stats(G: nx.DiGraph) → None`¶

Print a summary of graph statistics:

Total nodes and edges
Breakdown by node type
Breakdown by edge type
Number of connected components

`run() → None`¶

Execute the full Phase 3+4 pipeline:

Load configuration
Build graph via build_graph()
Print statistics
Export to knowledge_graph.json
Export node/edge files separately
Export GraphML format
Build and export simplified concept graph (JSON + GraphML)

GraphML Compatibility Note¶

GraphML does not support nested values such as dictionaries/lists. Before GraphML export, complex attributes are serialized into JSON strings, so fields like citation_by_year can be stored safely in .graphml files.

src/graph_builder¶

Functions¶

build_graph(cfg: dict) → nx.DiGraph¶

build_simplified_concept_graph(cfg: dict) → nx.DiGraph¶

graph_to_json(G: nx.DiGraph) → dict¶

print_graph_stats(G: nx.DiGraph) → None¶

run() → None¶

GraphML Compatibility Note¶

`src/graph_builder`¶

`build_graph(cfg: dict) → nx.DiGraph`¶

`build_simplified_concept_graph(cfg: dict) → nx.DiGraph`¶

`graph_to_json(G: nx.DiGraph) → dict`¶

`print_graph_stats(G: nx.DiGraph) → None`¶

`run() → None`¶