Phase 3+4: Knowledge Graph Construction¶
Module: src/graph_builder.py
Estimated Time: ~2 days
Objective¶
Combine all processed data (laureates, awards, publications, concepts) into a NetworkX directed graph with 5 node types and 8 edge types, then export to multiple formats.
Node Types¶
flowchart LR
L[π
Laureate] --- A[π Award]
L --- W[π Work]
W --- C[π‘ Concept]
C --- F[π¬ Field]
| Node Type | Example | Key Attributes |
|---|---|---|
| Laureate | "Aaron Ciechanover" | name, nationality, birth_year, gender |
| Award | "Chemistry 2004" | year, category, motivation, prize_amount |
| Work | Paper on ubiquitin | title, year, abstract, keywords, citation_count, doi |
| Concept | "Ubiquitin-Proteasome Pathway" | name, description, field, subfield, first_appeared_year |
| Field | "Biology" | name, parent_field, description |
Edge Types¶
| Edge Type | Source β Target | Attributes | Semantics |
|---|---|---|---|
WON_AWARD |
Laureate β Award | year, portion | Laureate received the award |
AUTHORED |
Laureate β Work | position | Laureate authored the paper |
CITES |
Work β Work | β | Paper citation |
INTRODUCES |
Work β Concept | confidence | Paper introduced/proposed the concept |
APPLIES |
Work β Concept | confidence | Paper applied/used the concept |
BELONGS_TO |
Concept β Field | β | Concept belongs to a scientific field |
DERIVED_FROM |
Concept β Concept | year, description | Concept evolved from another (same field) |
CROSS_INSPIRED |
Concept β Concept | year, source_field, target_field, description | Cross-disciplinary migration |
AWARDED_FOR |
Award β Concept | β | Award recognized this concept |
The CROSS_INSPIRED Edge¶
This is the most valuable edge type in the graph, representing cross-disciplinary concept migration:
Optimization Theory --CROSS_INSPIRED--> SGD (Math β AI, ~1960s)
Transformer --CROSS_INSPIRED--> AlphaFold (AI β Structural Biology, 2018)
X-ray Diffraction --CROSS_INSPIRED--> DNA Double Helix (Physics β Molecular Biology, 1953)
Statistical Mechanics --CROSS_INSPIRED--> Boltzmann Machine (Physics β ML, 1985)
Construction Process¶
flowchart TD
L[Load laureates.parquet] --> G[Initialize Graph]
A[Load awards.parquet] --> G
P[Load publications.parquet] --> G
C[Load concepts.parquet] --> G
G --> N1[Add Laureate Nodes]
N1 --> N2[Add Award Nodes + WON_AWARD Edges]
N2 --> N3[Add Work Nodes + AUTHORED Edges]
N3 --> N4[Add Concept Nodes + INTRODUCES/APPLIES Edges]
N4 --> N5[Add Field Nodes + BELONGS_TO Edges]
N5 --> E1[Build CROSS_INSPIRED Edges]
E1 --> E2[Build CITES Edges]
E2 --> E3[Build AWARDED_FOR Edges]
E3 --> EX[Export to JSON + GraphML]
Cross-Discipline Detection Algorithm¶
for each paper W:
W_concepts = concepts associated with W
W_field = primary field of W
for ref in W.referenced_works:
ref_field = primary field of ref
if W_field β ref_field:
shared_concepts = W_concepts β© ref_concepts
if shared_concepts is not empty:
β Generate CROSS_INSPIRED edge
β Record: concept migrated from ref_field to W_field
Output Files¶
| File | Format | Description |
|---|---|---|
knowledge_graph.json |
JSON | Full graph with nodes and edges |
nodes.json |
JSON | Nodes only |
edges.json |
JSON | Edges only |
knowledge_graph.graphml |
GraphML | For Gephi, Cytoscape, etc. |
concept_graph_simplified.json |
JSON | Simplified concept-only graph |
concept_graph_simplified_nodes.json |
JSON | Simplified graph nodes |
concept_graph_simplified_edges.json |
JSON | Simplified graph edges |
concept_graph_simplified.graphml |
GraphML | Simplified graph (complex attrs serialized as JSON strings) |
Simplified Concept Graph Schema (New)¶
- Contains only Concept nodes (deduplicated across papers)
- Node attributes include:
citation_by_year:{year: citation_count}paper_count: number of linked paperstotal_citations: sum over all years- Edge type is
CONCEPT_CITES, with: citation_by_year:{year: citation_count}total_citations: sum over all years
Note:
citation_by_yearis a dictionary. GraphML cannot store nested structures directly, so.graphmlexport serializes them as JSON strings, while JSON exports keep dictionary values.
JSON Structure¶
{
"nodes": [
{
"id": "laureate_779",
"type": "Laureate",
"name": "Aaron Ciechanover",
"nationality": "Israeli",
"birth_year": 1947
},
...
],
"edges": [
{
"source": "laureate_779",
"target": "award_2004_3_779",
"type": "WON_AWARD",
"year": 2004
},
...
]
}
Graph Statistics (Sample Mode)¶
| Metric | Value |
|---|---|
| Total Nodes | ~97 |
| Total Edges | ~181 |
| Laureates | 5 |
| Works | 25 |
| Concepts | 51 |
| Fields | 11 |
| Cross-disciplinary migrations | 11 |
Running¶
Concept Graph Schema¶
The Concept Graph is a simplified representation of the knowledge graph, focusing on concepts and their relationships. It is designed to highlight the flow of ideas and their connections across disciplines.
Schema Details¶
- Nodes:
id: Unique identifier for the concept.name: Human-readable name of the concept.paper_count: Number of papers associated with the concept.total_citations: Total citations received by papers linked to the concept.- Edges:
source: Source concept ID.target: Target concept ID.type: Relationship type (e.g.,CONCEPT_CITES).total_citations: Total citations between the connected concepts.
Construction Process¶
- Extract concepts from papers.
- Deduplicate concepts across papers.
- Establish relationships based on citations and shared concepts.
- Export the graph in JSON and GraphML formats.