Skip to content

Configuration

The project uses a YAML configuration file (config/settings.yaml) combined with environment variables for secrets.

Environment Variables

Create a .env file in the project root:

# Required for Phase 2b (LLM concept extraction)
OPENAI_API_KEY=sk-your-api-key-here

# Optional: custom OpenAI-compatible endpoint
OPENAI_BASE_URL=https://api.openai.com/v1

Configuration File

The main configuration file is config/settings.yaml:

LLM Configuration

llm:
  provider: openai                  # openai | anthropic | ollama
  model: gpt-4o-mini                # Concept extraction (cost-efficient)
  model_heavy: gpt-4o               # Relation validation / insight generation
  temperature: 0.2                  # Low temperature for consistency
  max_tokens: 4096
Parameter Type Default Description
provider string openai LLM provider
model string gpt-4o-mini Model for concept extraction
model_heavy string gpt-4o Model for complex reasoning tasks
temperature float 0.2 Sampling temperature
max_tokens int 4096 Maximum response tokens

OpenAlex API Configuration

openalex:
  base_url: https://api.openalex.org
  email: user@example.com           # Polite pool email (faster rate limits)
  rate_limit: 10                    # Requests per second
  cache_dir: output/openalex_cache  # API response cache directory
Parameter Type Default Description
base_url string https://api.openalex.org OpenAlex API base URL
email string Email for polite pool access (10 req/s)
rate_limit int 10 Max requests per second
cache_dir string output/openalex_cache Directory for cached API responses

Data Paths

data:
  raw_dir: data/26963326/db_data
  json_dir: data/26963326/json
  publication_records: data/26963326/json/publication_records.json
  laureate_info: data/26963326/json/laureate_info.json
  award_details: data/26963326/json/award_details.json

All paths are relative to the project root and are automatically resolved to absolute paths at runtime.

Output Paths

output:
  clean_data: output/clean_data
  concepts: output/concepts
  graph: output/graph
  viz: output/viz
  reports: output/reports

Graph Storage

graph_store:
  format: json                     # json | neo4j (optional)
  nodes_file: output/graph/nodes.json
  edges_file: output/graph/edges.json
  full_graph_file: output/graph/knowledge_graph.json

Concept Extraction Strategy

concept_extraction:
  tier1_max_papers: 1000           # Max papers for LLM extraction
  tier1_min_citations: 500         # Tier 1 citation threshold
  tier2_min_citations: 50          # Tier 2 citation threshold
  batch_size: 20                   # LLM batch processing size
  confidence_threshold: 0.7        # Concept confidence filter
Parameter Type Default Description
tier1_max_papers int 1000 Maximum papers for deep LLM extraction
tier1_min_citations int 500 Minimum citation count for Tier 1
tier2_min_citations int 50 Minimum citation count for Tier 2
batch_size int 20 Papers processed per LLM batch
confidence_threshold float 0.7 Minimum confidence for concept inclusion

Cross-Discipline Detection

cross_discipline:
  min_confidence: 0.6              # Minimum confidence for cross-field migration
  llm_verify: true                 # Use LLM to verify candidate migrations
  verify_sample_size: 200          # Number of candidates for LLM verification

Full Configuration Example

Complete config/settings.yaml
llm:
  provider: openai
  model: gpt-4o-mini
  model_heavy: gpt-4o
  temperature: 0.2
  max_tokens: 4096

openalex:
  base_url: https://api.openalex.org
  email: user@example.com
  rate_limit: 10
  cache_dir: output/openalex_cache

data:
  raw_dir: data/26963326/db_data
  json_dir: data/26963326/json
  publication_records: data/26963326/json/publication_records.json
  laureate_info: data/26963326/json/laureate_info.json
  award_details: data/26963326/json/award_details.json

output:
  clean_data: output/clean_data
  concepts: output/concepts
  graph: output/graph
  viz: output/viz
  reports: output/reports

graph_store:
  format: json
  nodes_file: output/graph/nodes.json
  edges_file: output/graph/edges.json
  full_graph_file: output/graph/knowledge_graph.json

concept_extraction:
  tier1_max_papers: 1000
  tier1_min_citations: 500
  tier2_min_citations: 50
  batch_size: 20
  confidence_threshold: 0.7

cross_discipline:
  min_confidence: 0.6
  llm_verify: true
  verify_sample_size: 200