Phase 3+4: Knowledge Graph Construction¶

Module: src/graph_builder.py
Estimated Time: ~2 days

Objective¶

Combine all processed data (laureates, awards, publications, concepts) into a NetworkX directed graph with 5 node types and 8 edge types, then export to multiple formats.

Node Types¶

flowchart LR
    L[🏅 Laureate] --- A[🏆 Award]
    L --- W[📄 Work]
    W --- C[💡 Concept]
    C --- F[🔬 Field]

Node Type	Example	Key Attributes
Laureate	"Aaron Ciechanover"	name, nationality, birth_year, gender
Award	"Chemistry 2004"	year, category, motivation, prize_amount
Work	Paper on ubiquitin	title, year, abstract, keywords, citation_count, doi
Concept	"Ubiquitin-Proteasome Pathway"	name, description, field, subfield, first_appeared_year
Field	"Biology"	name, parent_field, description

Edge Types¶

Edge Type	Source → Target	Attributes	Semantics
`WON_AWARD`	Laureate → Award	year, portion	Laureate received the award
`AUTHORED`	Laureate → Work	position	Laureate authored the paper
`CITES`	Work → Work	—	Paper citation
`INTRODUCES`	Work → Concept	confidence	Paper introduced/proposed the concept
`APPLIES`	Work → Concept	confidence	Paper applied/used the concept
`BELONGS_TO`	Concept → Field	—	Concept belongs to a scientific field
`DERIVED_FROM`	Concept → Concept	year, description	Concept evolved from another (same field)
`CROSS_INSPIRED`	Concept → Concept	year, source_field, target_field, description	Cross-disciplinary migration
`AWARDED_FOR`	Award → Concept	—	Award recognized this concept

The CROSS_INSPIRED Edge¶

This is the most valuable edge type in the graph, representing cross-disciplinary concept migration:

Optimization Theory --CROSS_INSPIRED--> SGD           (Math → AI, ~1960s)
Transformer        --CROSS_INSPIRED--> AlphaFold      (AI → Structural Biology, 2018)
X-ray Diffraction  --CROSS_INSPIRED--> DNA Double Helix (Physics → Molecular Biology, 1953)
Statistical Mechanics --CROSS_INSPIRED--> Boltzmann Machine (Physics → ML, 1985)

Construction Process¶

flowchart TD
    L[Load laureates.parquet] --> G[Initialize Graph]
    A[Load awards.parquet] --> G
    P[Load publications.parquet] --> G
    C[Load concepts.parquet] --> G

    G --> N1[Add Laureate Nodes]
    N1 --> N2[Add Award Nodes + WON_AWARD Edges]
    N2 --> N3[Add Work Nodes + AUTHORED Edges]
    N3 --> N4[Add Concept Nodes + INTRODUCES/APPLIES Edges]
    N4 --> N5[Add Field Nodes + BELONGS_TO Edges]
    N5 --> E1[Build CROSS_INSPIRED Edges]
    E1 --> E2[Build CITES Edges]
    E2 --> E3[Build AWARDED_FOR Edges]
    E3 --> EX[Export to JSON + GraphML]

Cross-Discipline Detection Algorithm¶

for each paper W:
    W_concepts = concepts associated with W
    W_field = primary field of W
    for ref in W.referenced_works:
        ref_field = primary field of ref
        if W_field ≠ ref_field:
            shared_concepts = W_concepts ∩ ref_concepts
            if shared_concepts is not empty:
                → Generate CROSS_INSPIRED edge
                → Record: concept migrated from ref_field to W_field

Output Files¶

File	Format	Description
`knowledge_graph.json`	JSON	Full graph with nodes and edges
`nodes.json`	JSON	Nodes only
`edges.json`	JSON	Edges only
`knowledge_graph.graphml`	GraphML	For Gephi, Cytoscape, etc.
`concept_graph_simplified.json`	JSON	Simplified concept-only graph
`concept_graph_simplified_nodes.json`	JSON	Simplified graph nodes
`concept_graph_simplified_edges.json`	JSON	Simplified graph edges
`concept_graph_simplified.graphml`	GraphML	Simplified graph (complex attrs serialized as JSON strings)

Simplified Concept Graph Schema (New)¶

Contains only Concept nodes (deduplicated across papers)
Node attributes include:
citation_by_year: {year: citation_count}
paper_count: number of linked papers
total_citations: sum over all years
Edge type is CONCEPT_CITES, with:
citation_by_year: {year: citation_count}
total_citations: sum over all years

Note: citation_by_year is a dictionary. GraphML cannot store nested structures directly, so .graphml export serializes them as JSON strings, while JSON exports keep dictionary values.

JSON Structure¶

{
  "nodes": [
    {
      "id": "laureate_779",
      "type": "Laureate",
      "name": "Aaron Ciechanover",
      "nationality": "Israeli",
      "birth_year": 1947
    },
    ...
  ],
  "edges": [
    {
      "source": "laureate_779",
      "target": "award_2004_3_779",
      "type": "WON_AWARD",
      "year": 2004
    },
    ...
  ]
}

Graph Statistics (Sample Mode)¶

Metric	Value
Total Nodes	~97
Total Edges	~181
Laureates	5
Works	25
Concepts	51
Fields	11
Cross-disciplinary migrations	11

Running¶

# Via pipeline
uv run python main.py --phase 3

# Standalone
uv run python -m src.graph_builder

Concept Graph Schema¶

The Concept Graph is a simplified representation of the knowledge graph, focusing on concepts and their relationships. It is designed to highlight the flow of ideas and their connections across disciplines.

Schema Details¶

Nodes:
id: Unique identifier for the concept.
name: Human-readable name of the concept.
paper_count: Number of papers associated with the concept.
total_citations: Total citations received by papers linked to the concept.
Edges:
source: Source concept ID.
target: Target concept ID.
type: Relationship type (e.g., CONCEPT_CITES).
total_citations: Total citations between the connected concepts.

Construction Process¶

Extract concepts from papers.
Deduplicate concepts across papers.
Establish relationships based on citations and shared concepts.
Export the graph in JSON and GraphML formats.

Example JSON Structure¶

{
  "nodes": [
    {
      "id": "concept_1",
      "name": "Quantum Mechanics",
      "paper_count": 120,
      "total_citations": 4500
    }
  ],
  "edges": [
    {
      "source": "concept_1",
      "target": "concept_2",
      "type": "CONCEPT_CITES",
      "total_citations": 300
    }
  ]
}