src/graph_builder¶
Phase 3+4 module for constructing the knowledge graph.
Functions¶
build_graph(cfg: dict) → nx.DiGraph¶
Build the complete knowledge graph from Parquet data files.
Parameters:
| Name | Type | Description |
|---|---|---|
cfg |
dict |
Project configuration dict |
Returns: NetworkX directed graph containing all node types and edge types.
Process:
- Load all Parquet files (
laureates,awards,publications,institutions,concepts) - Add Laureate nodes with attributes
- Add Award nodes +
WON_AWARDedges - Add Work nodes +
AUTHOREDedges - Add Concept nodes +
INTRODUCES/APPLIESedges - Add Field nodes +
BELONGS_TOedges - Detect and build
CROSS_INSPIREDedges - Build
CITESedges from referenced works - Build
AWARDED_FORedges linking awards to concepts
build_simplified_concept_graph(cfg: dict) → nx.DiGraph¶
Build a simplified concept-only graph.
Features:
- Deduplicates concepts across papers
- Stores node-level
citation_by_yeardictionaries - Uses
CONCEPT_CITESedges with year-wise citation dictionaries
Parameters:
| Name | Type | Description |
|---|---|---|
cfg |
dict |
Project configuration dict |
Returns: nx.DiGraph containing only concept nodes.
graph_to_json(G: nx.DiGraph) → dict¶
Convert a NetworkX graph to a JSON-serializable dictionary.
Parameters:
| Name | Type | Description |
|---|---|---|
G |
nx.DiGraph |
NetworkX directed graph |
Returns: Dict with structure:
{
"nodes": [
{"id": "laureate_779", "type": "Laureate", "name": "...", ...},
...
],
"edges": [
{"source": "laureate_779", "target": "award_...", "type": "WON_AWARD", ...},
...
]
}
print_graph_stats(G: nx.DiGraph) → None¶
Print a summary of graph statistics:
- Total nodes and edges
- Breakdown by node type
- Breakdown by edge type
- Number of connected components
run() → None¶
Execute the full Phase 3+4 pipeline:
- Load configuration
- Build graph via
build_graph() - Print statistics
- Export to
knowledge_graph.json - Export node/edge files separately
- Export GraphML format
- Build and export simplified concept graph (JSON + GraphML)
GraphML Compatibility Note¶
GraphML does not support nested values such as dictionaries/lists.
Before GraphML export, complex attributes are serialized into JSON strings,
so fields like citation_by_year can be stored safely in .graphml files.