Skip to content

src/graph_builder

Phase 3+4 module for constructing the knowledge graph.

Functions

build_graph(cfg: dict) → nx.DiGraph

Build the complete knowledge graph from Parquet data files.

Parameters:

Name Type Description
cfg dict Project configuration dict

Returns: NetworkX directed graph containing all node types and edge types.

Process:

  1. Load all Parquet files (laureates, awards, publications, institutions, concepts)
  2. Add Laureate nodes with attributes
  3. Add Award nodes + WON_AWARD edges
  4. Add Work nodes + AUTHORED edges
  5. Add Concept nodes + INTRODUCES/APPLIES edges
  6. Add Field nodes + BELONGS_TO edges
  7. Detect and build CROSS_INSPIRED edges
  8. Build CITES edges from referenced works
  9. Build AWARDED_FOR edges linking awards to concepts

build_simplified_concept_graph(cfg: dict) → nx.DiGraph

Build a simplified concept-only graph.

Features:

  • Deduplicates concepts across papers
  • Stores node-level citation_by_year dictionaries
  • Uses CONCEPT_CITES edges with year-wise citation dictionaries

Parameters:

Name Type Description
cfg dict Project configuration dict

Returns: nx.DiGraph containing only concept nodes.


graph_to_json(G: nx.DiGraph) → dict

Convert a NetworkX graph to a JSON-serializable dictionary.

Parameters:

Name Type Description
G nx.DiGraph NetworkX directed graph

Returns: Dict with structure:

{
    "nodes": [
        {"id": "laureate_779", "type": "Laureate", "name": "...", ...},
        ...
    ],
    "edges": [
        {"source": "laureate_779", "target": "award_...", "type": "WON_AWARD", ...},
        ...
    ]
}

Print a summary of graph statistics:

  • Total nodes and edges
  • Breakdown by node type
  • Breakdown by edge type
  • Number of connected components

run() → None

Execute the full Phase 3+4 pipeline:

  1. Load configuration
  2. Build graph via build_graph()
  3. Print statistics
  4. Export to knowledge_graph.json
  5. Export node/edge files separately
  6. Export GraphML format
  7. Build and export simplified concept graph (JSON + GraphML)

GraphML Compatibility Note

GraphML does not support nested values such as dictionaries/lists. Before GraphML export, complex attributes are serialized into JSON strings, so fields like citation_by_year can be stored safely in .graphml files.