Configuration¶
The project uses a YAML configuration file (config/settings.yaml) combined with environment variables for secrets.
Environment Variables¶
Create a .env file in the project root:
# Required for Phase 2b (LLM concept extraction)
OPENAI_API_KEY=sk-your-api-key-here
# Optional: custom OpenAI-compatible endpoint
OPENAI_BASE_URL=https://api.openai.com/v1
Configuration File¶
The main configuration file is config/settings.yaml:
LLM Configuration¶
llm:
provider: openai # openai | anthropic | ollama
model: gpt-4o-mini # Concept extraction (cost-efficient)
model_heavy: gpt-4o # Relation validation / insight generation
temperature: 0.2 # Low temperature for consistency
max_tokens: 4096
| Parameter | Type | Default | Description |
|---|---|---|---|
provider |
string | openai |
LLM provider |
model |
string | gpt-4o-mini |
Model for concept extraction |
model_heavy |
string | gpt-4o |
Model for complex reasoning tasks |
temperature |
float | 0.2 |
Sampling temperature |
max_tokens |
int | 4096 |
Maximum response tokens |
OpenAlex API Configuration¶
openalex:
base_url: https://api.openalex.org
email: user@example.com # Polite pool email (faster rate limits)
rate_limit: 10 # Requests per second
cache_dir: output/openalex_cache # API response cache directory
| Parameter | Type | Default | Description |
|---|---|---|---|
base_url |
string | https://api.openalex.org |
OpenAlex API base URL |
email |
string | — | Email for polite pool access (10 req/s) |
rate_limit |
int | 10 |
Max requests per second |
cache_dir |
string | output/openalex_cache |
Directory for cached API responses |
Data Paths¶
data:
raw_dir: data/26963326/db_data
json_dir: data/26963326/json
publication_records: data/26963326/json/publication_records.json
laureate_info: data/26963326/json/laureate_info.json
award_details: data/26963326/json/award_details.json
All paths are relative to the project root and are automatically resolved to absolute paths at runtime.
Output Paths¶
output:
clean_data: output/clean_data
concepts: output/concepts
graph: output/graph
viz: output/viz
reports: output/reports
Graph Storage¶
graph_store:
format: json # json | neo4j (optional)
nodes_file: output/graph/nodes.json
edges_file: output/graph/edges.json
full_graph_file: output/graph/knowledge_graph.json
Concept Extraction Strategy¶
concept_extraction:
tier1_max_papers: 1000 # Max papers for LLM extraction
tier1_min_citations: 500 # Tier 1 citation threshold
tier2_min_citations: 50 # Tier 2 citation threshold
batch_size: 20 # LLM batch processing size
confidence_threshold: 0.7 # Concept confidence filter
| Parameter | Type | Default | Description |
|---|---|---|---|
tier1_max_papers |
int | 1000 |
Maximum papers for deep LLM extraction |
tier1_min_citations |
int | 500 |
Minimum citation count for Tier 1 |
tier2_min_citations |
int | 50 |
Minimum citation count for Tier 2 |
batch_size |
int | 20 |
Papers processed per LLM batch |
confidence_threshold |
float | 0.7 |
Minimum confidence for concept inclusion |
Cross-Discipline Detection¶
cross_discipline:
min_confidence: 0.6 # Minimum confidence for cross-field migration
llm_verify: true # Use LLM to verify candidate migrations
verify_sample_size: 200 # Number of candidates for LLM verification
Full Configuration Example¶
Complete config/settings.yaml
llm:
provider: openai
model: gpt-4o-mini
model_heavy: gpt-4o
temperature: 0.2
max_tokens: 4096
openalex:
base_url: https://api.openalex.org
email: user@example.com
rate_limit: 10
cache_dir: output/openalex_cache
data:
raw_dir: data/26963326/db_data
json_dir: data/26963326/json
publication_records: data/26963326/json/publication_records.json
laureate_info: data/26963326/json/laureate_info.json
award_details: data/26963326/json/award_details.json
output:
clean_data: output/clean_data
concepts: output/concepts
graph: output/graph
viz: output/viz
reports: output/reports
graph_store:
format: json
nodes_file: output/graph/nodes.json
edges_file: output/graph/edges.json
full_graph_file: output/graph/knowledge_graph.json
concept_extraction:
tier1_max_papers: 1000
tier1_min_citations: 500
tier2_min_citations: 50
batch_size: 20
confidence_threshold: 0.7
cross_discipline:
min_confidence: 0.6
llm_verify: true
verify_sample_size: 200