配置指南¶

项目使用 YAML 配置文件（config/settings.yaml）结合环境变量管理密钥。

环境变量¶

在项目根目录创建 .env 文件：

# 阶段 2b（LLM 概念抽取）必需
OPENAI_API_KEY=sk-your-api-key-here

# 可选：自定义 OpenAI 兼容端点
OPENAI_BASE_URL=https://api.openai.com/v1

配置文件¶

主配置文件为 config/settings.yaml：

LLM 配置¶

llm:
  provider: openai                  # openai | anthropic | ollama
  model: gpt-4o-mini                # 概念抽取（成本优化）
  model_heavy: gpt-4o               # 关系验证/洞见生成
  temperature: 0.2                  # 低温度保证一致性
  max_tokens: 4096

参数	类型	默认值	说明
`provider`	string	`openai`	LLM 提供商
`model`	string	`gpt-4o-mini`	概念抽取模型
`model_heavy`	string	`gpt-4o`	复杂推理任务模型
`temperature`	float	`0.2`	采样温度
`max_tokens`	int	`4096`	最大响应 token 数

OpenAlex API 配置¶

openalex:
  base_url: https://api.openalex.org
  email: user@example.com           # Polite pool 邮箱（更快的速率限制）
  rate_limit: 10                    # 每秒请求数
  cache_dir: output/openalex_cache  # API 响应缓存目录

参数	类型	默认值	说明
`base_url`	string	`https://api.openalex.org`	OpenAlex API 基础 URL
`email`	string	—	Polite pool 访问邮箱（10 请求/秒）
`rate_limit`	int	`10`	每秒最大请求数
`cache_dir`	string	`output/openalex_cache`	缓存 API 响应的目录

数据路径¶

data:
  raw_dir: data/26963326/db_data
  json_dir: data/26963326/json
  publication_records: data/26963326/json/publication_records.json
  laureate_info: data/26963326/json/laureate_info.json
  award_details: data/26963326/json/award_details.json

所有路径相对于项目根目录，运行时自动解析为绝对路径。

输出路径¶

output:
  clean_data: output/clean_data
  concepts: output/concepts
  graph: output/graph
  viz: output/viz
  reports: output/reports

图谱存储¶

graph_store:
  format: json                     # json | neo4j（可选）
  nodes_file: output/graph/nodes.json
  edges_file: output/graph/edges.json
  full_graph_file: output/graph/knowledge_graph.json

概念抽取策略¶

concept_extraction:
  tier1_max_papers: 1000           # LLM 抽取论文数上限
  tier1_min_citations: 500         # Tier 1 引用阈值
  tier2_min_citations: 50          # Tier 2 引用阈值
  batch_size: 20                   # LLM 批处理大小
  confidence_threshold: 0.7        # 概念置信度过滤阈值

参数	类型	默认值	说明
`tier1_max_papers`	int	`1000`	LLM 深度抽取的最大论文数
`tier1_min_citations`	int	`500`	Tier 1 的最低引用数
`tier2_min_citations`	int	`50`	Tier 2 的最低引用数
`batch_size`	int	`20`	每批处理的论文数
`confidence_threshold`	float	`0.7`	概念纳入的最低置信度

跨学科检测¶

cross_discipline:
  min_confidence: 0.6              # 跨领域迁移最低置信度
  llm_verify: true                 # 是否用 LLM 验证候选迁移
  verify_sample_size: 200          # LLM 验证抽样数量

完整配置示例¶

完整 config/settings.yaml

llm:
  provider: openai
  model: gpt-4o-mini
  model_heavy: gpt-4o
  temperature: 0.2
  max_tokens: 4096

openalex:
  base_url: https://api.openalex.org
  email: user@example.com
  rate_limit: 10
  cache_dir: output/openalex_cache

data:
  raw_dir: data/26963326/db_data
  json_dir: data/26963326/json
  publication_records: data/26963326/json/publication_records.json
  laureate_info: data/26963326/json/laureate_info.json
  award_details: data/26963326/json/award_details.json

output:
  clean_data: output/clean_data
  concepts: output/concepts
  graph: output/graph
  viz: output/viz
  reports: output/reports

graph_store:
  format: json
  nodes_file: output/graph/nodes.json
  edges_file: output/graph/edges.json
  full_graph_file: output/graph/knowledge_graph.json

concept_extraction:
  tier1_max_papers: 1000
  tier1_min_citations: 500
  tier2_min_citations: 50
  batch_size: 20
  confidence_threshold: 0.7

cross_discipline:
  min_confidence: 0.6
  llm_verify: true
  verify_sample_size: 200