Phase 3+4: Knowledge Graph Construction¶
Module: src/graph_builder.py
Estimated Time: ~2 days
Objective¶
Combine all processed data (laureates, awards, publications, concepts) into a NetworkX directed graph with 5 node types and 8 edge types, then export to multiple formats.
Node Types¶
flowchart LR
L[π
Laureate] --- A[π Award]
L --- W[π Work]
W --- C[π‘ Concept]
C --- F[π¬ Field]
| Node Type | Example | Key Attributes |
|---|---|---|
| Laureate | "Aaron Ciechanover" | name, nationality, birth_year, gender |
| Award | "Chemistry 2004" | year, category, motivation, prize_amount |
| Work | Paper on ubiquitin | title, year, abstract, keywords, citation_count, doi |
| Concept | "Ubiquitin-Proteasome Pathway" | name, description, field, subfield, first_appeared_year |
| Field | "Biology" | name, parent_field, description |
Edge Types¶
| Edge Type | Source β Target | Attributes | Semantics |
|---|---|---|---|
WON_AWARD |
Laureate β Award | year, portion | Laureate received the award |
AUTHORED |
Laureate β Work | position | Laureate authored the paper |
CITES |
Work β Work | β | Paper citation |
INTRODUCES |
Work β Concept | confidence | Paper introduced/proposed the concept |
APPLIES |
Work β Concept | confidence | Paper applied/used the concept |
BELONGS_TO |
Concept β Field | β | Concept belongs to a scientific field |
DERIVED_FROM |
Concept β Concept | year, description | Concept evolved from another (same field) |
CROSS_INSPIRED |
Concept β Concept | year, source_field, target_field, description | Cross-disciplinary migration |
AWARDED_FOR |
Award β Concept | β | Award recognized this concept |
The CROSS_INSPIRED Edge¶
This is the most valuable edge type in the graph, representing cross-disciplinary concept migration:
Optimization Theory --CROSS_INSPIRED--> SGD (Math β AI, ~1960s)
Transformer --CROSS_INSPIRED--> AlphaFold (AI β Structural Biology, 2018)
X-ray Diffraction --CROSS_INSPIRED--> DNA Double Helix (Physics β Molecular Biology, 1953)
Statistical Mechanics --CROSS_INSPIRED--> Boltzmann Machine (Physics β ML, 1985)
Construction Process¶
flowchart TD
L[Load laureates.parquet] --> G[Initialize Graph]
A[Load awards.parquet] --> G
P[Load publications.parquet] --> G
C[Load concepts.parquet] --> G
G --> N1[Add Laureate Nodes]
N1 --> N2[Add Award Nodes + WON_AWARD Edges]
N2 --> N3[Add Work Nodes + AUTHORED Edges]
N3 --> N4[Add Concept Nodes + INTRODUCES/APPLIES Edges]
N4 --> N5[Add Field Nodes + BELONGS_TO Edges]
N5 --> E1[Build CROSS_INSPIRED Edges]
E1 --> E2[Build CITES Edges]
E2 --> E3[Build AWARDED_FOR Edges]
E3 --> EX[Export to JSON + GraphML]
Cross-Discipline Detection Algorithm¶
for each paper W:
W_concepts = concepts associated with W
W_field = primary field of W
for ref in W.referenced_works:
ref_field = primary field of ref
if W_field β ref_field:
shared_concepts = W_concepts β© ref_concepts
if shared_concepts is not empty:
β Generate CROSS_INSPIRED edge
β Record: concept migrated from ref_field to W_field
Output Files¶
| File | Format | Description |
|---|---|---|
knowledge_graph.json |
JSON | Full graph with nodes and edges |
nodes.json |
JSON | Nodes only |
edges.json |
JSON | Edges only |
knowledge_graph.graphml |
GraphML | For Gephi, Cytoscape, etc. |
concept_graph_simplified.json |
JSON | Simplified concept-only graph |
concept_graph_simplified_nodes.json |
JSON | Simplified graph nodes |
concept_graph_simplified_edges.json |
JSON | Simplified graph edges |
concept_graph_simplified.graphml |
GraphML | Simplified graph (complex attrs serialized as JSON strings) |
Simplified Concept Graph Schema (New)¶
- Contains only Concept nodes (deduplicated across papers)
- Node attributes include:
citation_by_year:{year: citation_count}paper_count: number of linked paperstotal_citations: sum over all years- Edge type is
CONCEPT_CITES, with: citation_by_year:{year: citation_count}total_citations: sum over all years
Note:
citation_by_yearis a dictionary. GraphML cannot store nested structures directly, so.graphmlexport serializes them as JSON strings, while JSON exports keep dictionary values.
JSON Structure¶
{
"nodes": [
{
"id": "laureate_779",
"type": "Laureate",
"name": "Aaron Ciechanover",
"nationality": "Israeli",
"birth_year": 1947
},
...
],
"edges": [
{
"source": "laureate_779",
"target": "award_2004_3_779",
"type": "WON_AWARD",
"year": 2004
},
...
]
}
Graph Statistics (Sample Mode)¶
| Metric | Value |
|---|---|
| Total Nodes | ~97 |
| Total Edges | ~181 |
| Laureates | 5 |
| Works | 25 |
| Concepts | 51 |
| Fields | 11 |
| Cross-disciplinary migrations | 11 |
Running¶
# Build full graph only
uv run python main.py --phase 3 --graph-mode full
# Build concept graph only
uv run python main.py --phase 3 --graph-mode concept
# Build both (default)
uv run python main.py --phase 3
# Standalone
uv run python -m src.graph_builder
Streaming for Large Datasets
The graph builder uses chunked Parquet reading and streaming JSON output to keep memory usage constant regardless of dataset size. The chunk size is configurable via graph.chunk_size in config/settings.yaml (default: 50,000 rows).
Concept Graph Schema¶
The Concept Graph is a simplified representation of the knowledge graph, focusing on concepts and their relationships. It is designed to highlight the flow of ideas and their connections across disciplines.
Schema Details¶
- Nodes:
id: Unique identifier for the concept.name: Human-readable name of the concept.paper_count: Number of papers associated with the concept.total_citations: Total citations received by papers linked to the concept.- Edges:
source: Source concept ID.target: Target concept ID.type: Relationship type (e.g.,CONCEPT_CITES).total_citations: Total citations between the connected concepts.
Construction Process¶
- Extract concepts from papers.
- Deduplicate concepts across papers.
- Establish relationships based on citations and shared concepts.
- Export the graph in JSON and GraphML formats.