Skip to content

Phase 3+4: Knowledge Graph Construction

Module: src/graph_builder.py
Estimated Time: ~2 days

Objective

Combine all processed data (laureates, awards, publications, concepts) into a NetworkX directed graph with 5 node types and 8 edge types, then export to multiple formats.

Node Types

flowchart LR
    L[πŸ… Laureate] --- A[πŸ† Award]
    L --- W[πŸ“„ Work]
    W --- C[πŸ’‘ Concept]
    C --- F[πŸ”¬ Field]
Node Type Example Key Attributes
Laureate "Aaron Ciechanover" name, nationality, birth_year, gender
Award "Chemistry 2004" year, category, motivation, prize_amount
Work Paper on ubiquitin title, year, abstract, keywords, citation_count, doi
Concept "Ubiquitin-Proteasome Pathway" name, description, field, subfield, first_appeared_year
Field "Biology" name, parent_field, description

Edge Types

Edge Type Source β†’ Target Attributes Semantics
WON_AWARD Laureate β†’ Award year, portion Laureate received the award
AUTHORED Laureate β†’ Work position Laureate authored the paper
CITES Work β†’ Work β€” Paper citation
INTRODUCES Work β†’ Concept confidence Paper introduced/proposed the concept
APPLIES Work β†’ Concept confidence Paper applied/used the concept
BELONGS_TO Concept β†’ Field β€” Concept belongs to a scientific field
DERIVED_FROM Concept β†’ Concept year, description Concept evolved from another (same field)
CROSS_INSPIRED Concept β†’ Concept year, source_field, target_field, description Cross-disciplinary migration
AWARDED_FOR Award β†’ Concept β€” Award recognized this concept

The CROSS_INSPIRED Edge

This is the most valuable edge type in the graph, representing cross-disciplinary concept migration:

Optimization Theory --CROSS_INSPIRED--> SGD           (Math β†’ AI, ~1960s)
Transformer        --CROSS_INSPIRED--> AlphaFold      (AI β†’ Structural Biology, 2018)
X-ray Diffraction  --CROSS_INSPIRED--> DNA Double Helix (Physics β†’ Molecular Biology, 1953)
Statistical Mechanics --CROSS_INSPIRED--> Boltzmann Machine (Physics β†’ ML, 1985)

Construction Process

flowchart TD
    L[Load laureates.parquet] --> G[Initialize Graph]
    A[Load awards.parquet] --> G
    P[Load publications.parquet] --> G
    C[Load concepts.parquet] --> G

    G --> N1[Add Laureate Nodes]
    N1 --> N2[Add Award Nodes + WON_AWARD Edges]
    N2 --> N3[Add Work Nodes + AUTHORED Edges]
    N3 --> N4[Add Concept Nodes + INTRODUCES/APPLIES Edges]
    N4 --> N5[Add Field Nodes + BELONGS_TO Edges]
    N5 --> E1[Build CROSS_INSPIRED Edges]
    E1 --> E2[Build CITES Edges]
    E2 --> E3[Build AWARDED_FOR Edges]
    E3 --> EX[Export to JSON + GraphML]

Cross-Discipline Detection Algorithm

for each paper W:
    W_concepts = concepts associated with W
    W_field = primary field of W
    for ref in W.referenced_works:
        ref_field = primary field of ref
        if W_field β‰  ref_field:
            shared_concepts = W_concepts ∩ ref_concepts
            if shared_concepts is not empty:
                β†’ Generate CROSS_INSPIRED edge
                β†’ Record: concept migrated from ref_field to W_field

Output Files

File Format Description
knowledge_graph.json JSON Full graph with nodes and edges
nodes.json JSON Nodes only
edges.json JSON Edges only
knowledge_graph.graphml GraphML For Gephi, Cytoscape, etc.
concept_graph_simplified.json JSON Simplified concept-only graph
concept_graph_simplified_nodes.json JSON Simplified graph nodes
concept_graph_simplified_edges.json JSON Simplified graph edges
concept_graph_simplified.graphml GraphML Simplified graph (complex attrs serialized as JSON strings)

Simplified Concept Graph Schema (New)

  • Contains only Concept nodes (deduplicated across papers)
  • Node attributes include:
  • citation_by_year: {year: citation_count}
  • paper_count: number of linked papers
  • total_citations: sum over all years
  • Edge type is CONCEPT_CITES, with:
  • citation_by_year: {year: citation_count}
  • total_citations: sum over all years

Note: citation_by_year is a dictionary. GraphML cannot store nested structures directly, so .graphml export serializes them as JSON strings, while JSON exports keep dictionary values.

JSON Structure

{
  "nodes": [
    {
      "id": "laureate_779",
      "type": "Laureate",
      "name": "Aaron Ciechanover",
      "nationality": "Israeli",
      "birth_year": 1947
    },
    ...
  ],
  "edges": [
    {
      "source": "laureate_779",
      "target": "award_2004_3_779",
      "type": "WON_AWARD",
      "year": 2004
    },
    ...
  ]
}

Graph Statistics (Sample Mode)

Metric Value
Total Nodes ~97
Total Edges ~181
Laureates 5
Works 25
Concepts 51
Fields 11
Cross-disciplinary migrations 11

Running

# Via pipeline
uv run python main.py --phase 3

# Standalone
uv run python -m src.graph_builder

Concept Graph Schema

The Concept Graph is a simplified representation of the knowledge graph, focusing on concepts and their relationships. It is designed to highlight the flow of ideas and their connections across disciplines.

Schema Details

  • Nodes:
  • id: Unique identifier for the concept.
  • name: Human-readable name of the concept.
  • paper_count: Number of papers associated with the concept.
  • total_citations: Total citations received by papers linked to the concept.
  • Edges:
  • source: Source concept ID.
  • target: Target concept ID.
  • type: Relationship type (e.g., CONCEPT_CITES).
  • total_citations: Total citations between the connected concepts.

Construction Process

  1. Extract concepts from papers.
  2. Deduplicate concepts across papers.
  3. Establish relationships based on citations and shared concepts.
  4. Export the graph in JSON and GraphML formats.

Example JSON Structure

{
  "nodes": [
    {
      "id": "concept_1",
      "name": "Quantum Mechanics",
      "paper_count": 120,
      "total_citations": 4500
    }
  ],
  "edges": [
    {
      "source": "concept_1",
      "target": "concept_2",
      "type": "CONCEPT_CITES",
      "total_citations": 300
    }
  ]
}