配置指南¶
项目使用 YAML 配置文件(config/settings.yaml)结合环境变量管理密钥。
环境变量¶
在项目根目录创建 .env 文件:
# 阶段 2b(LLM 概念抽取)必需
OPENAI_API_KEY=sk-your-api-key-here
# 可选:自定义 OpenAI 兼容端点
OPENAI_BASE_URL=https://api.openai.com/v1
配置文件¶
主配置文件为 config/settings.yaml:
LLM 配置¶
llm:
provider: openai # openai | anthropic | ollama
model: gpt-4o-mini # 概念抽取(成本优化)
model_heavy: gpt-4o # 关系验证/洞见生成
temperature: 0.2 # 低温度保证一致性
max_tokens: 4096
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
provider |
string | openai |
LLM 提供商 |
model |
string | gpt-4o-mini |
概念抽取模型 |
model_heavy |
string | gpt-4o |
复杂推理任务模型 |
temperature |
float | 0.2 |
采样温度 |
max_tokens |
int | 4096 |
最大响应 token 数 |
OpenAlex API 配置¶
openalex:
base_url: https://api.openalex.org
email: user@example.com # Polite pool 邮箱(更快的速率限制)
rate_limit: 10 # 每秒请求数
cache_dir: output/openalex_cache # API 响应缓存目录
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
base_url |
string | https://api.openalex.org |
OpenAlex API 基础 URL |
email |
string | — | Polite pool 访问邮箱(10 请求/秒) |
rate_limit |
int | 10 |
每秒最大请求数 |
cache_dir |
string | output/openalex_cache |
缓存 API 响应的目录 |
数据路径¶
data:
raw_dir: data/26963326/db_data
json_dir: data/26963326/json
publication_records: data/26963326/json/publication_records.json
laureate_info: data/26963326/json/laureate_info.json
award_details: data/26963326/json/award_details.json
所有路径相对于项目根目录,运行时自动解析为绝对路径。
输出路径¶
output:
clean_data: output/clean_data
concepts: output/concepts
graph: output/graph
viz: output/viz
reports: output/reports
图谱存储¶
graph_store:
format: json # json | neo4j(可选)
nodes_file: output/graph/nodes.json
edges_file: output/graph/edges.json
full_graph_file: output/graph/knowledge_graph.json
概念抽取策略¶
concept_extraction:
tier1_max_papers: 1000 # LLM 抽取论文数上限
tier1_min_citations: 500 # Tier 1 引用阈值
tier2_min_citations: 50 # Tier 2 引用阈值
batch_size: 20 # LLM 批处理大小
confidence_threshold: 0.7 # 概念置信度过滤阈值
| 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|
tier1_max_papers |
int | 1000 |
LLM 深度抽取的最大论文数 |
tier1_min_citations |
int | 500 |
Tier 1 的最低引用数 |
tier2_min_citations |
int | 50 |
Tier 2 的最低引用数 |
batch_size |
int | 20 |
每批处理的论文数 |
confidence_threshold |
float | 0.7 |
概念纳入的最低置信度 |
跨学科检测¶
cross_discipline:
min_confidence: 0.6 # 跨领域迁移最低置信度
llm_verify: true # 是否用 LLM 验证候选迁移
verify_sample_size: 200 # LLM 验证抽样数量
完整配置示例¶
完整 config/settings.yaml
llm:
provider: openai
model: gpt-4o-mini
model_heavy: gpt-4o
temperature: 0.2
max_tokens: 4096
openalex:
base_url: https://api.openalex.org
email: user@example.com
rate_limit: 10
cache_dir: output/openalex_cache
data:
raw_dir: data/26963326/db_data
json_dir: data/26963326/json
publication_records: data/26963326/json/publication_records.json
laureate_info: data/26963326/json/laureate_info.json
award_details: data/26963326/json/award_details.json
output:
clean_data: output/clean_data
concepts: output/concepts
graph: output/graph
viz: output/viz
reports: output/reports
graph_store:
format: json
nodes_file: output/graph/nodes.json
edges_file: output/graph/edges.json
full_graph_file: output/graph/knowledge_graph.json
concept_extraction:
tier1_max_papers: 1000
tier1_min_citations: 500
tier2_min_citations: 50
batch_size: 20
confidence_threshold: 0.7
cross_discipline:
min_confidence: 0.6
llm_verify: true
verify_sample_size: 200