I.1 概要：バイオインフォマティクスにおけるデータベースの重要性

I.1.1 データベースエコシステムの理解

バイオインフォマティクス研究では、多様なデータベースが相互に連携し、包括的な生物学的情報基盤を形成しています。

図 I-1: バイオインフォマティクスにおけるデータベースエコシステム。1次/2次/専門/統合DBおよび解析プラットフォームの役割とデータフローを整理。

I.1.2 データベース選択の戦略的アプローチ

研究目的に応じた効果的なデータベース選択フレームワーク：

Step 1: 研究クエスチョンの分類

この断片は、研究クエスチョンから参照候補DBを整理するための考え方を示す教育用の決定表です。各DBの現行API、利用条件、release、rate limit、疾患・臨床領域ごとの妥当性は検証していないため、そのまま自動推薦システムや臨床判断には使えません。

🧪 概念例（実行不可: DB選定ロジック・API接続・release確認を省略した Python 断片）

def classify_research_question(question_type, data_scope, analysis_depth):
    """
    研究クエスチョンに基づくデータベース推奨システム
    
    Args:
        question_type: "functional", "structural", "evolutionary", "clinical"
        data_scope: "single_gene", "pathway", "genome_wide", "multi_omics"
        analysis_depth: "descriptive", "comparative", "predictive", "causal"
    
    Returns:
        dict: 推奨データベースとアクセス戦略
    """
    
    recommendations = {
        "functional": {
            "single_gene": {
                "descriptive": ["UniProt", "GO", "InterPro"],
                "comparative": ["UniProt", "GO", "OMA"],
                "predictive": ["STRING", "GO", "KEGG"],
                "causal": ["GO", "KEGG", "Reactome"]
            },
            "pathway": {
                "descriptive": ["KEGG", "Reactome", "BioCyc"],
                "comparative": ["KEGG", "STRING", "GO"],
                "predictive": ["KEGG", "STRING", "MetaCyc"],
                "causal": ["Reactome", "KEGG", "SIGNOR"]
            }
        },
        "structural": {
            "single_gene": {
                "descriptive": ["PDB", "UniProt", "Pfam"],
                "comparative": ["PDB", "CATH", "SCOP"],
                "predictive": ["AlphaFold", "ModBase", "I-TASSER"],
                "causal": ["PDB", "CASTp", "ConCavity"]
            }
        },
        "clinical": {
            "single_gene": {
                "descriptive": ["ClinVar", "OMIM", "PharmGKB"],
                "comparative": ["ClinVar", "COSMIC", "gnomAD"],
                "predictive": ["ClinVar", "PharmGKB", "DGIdb"],
                "causal": ["ClinVar", "OMIM", "DisGeNET"]
            },
            "genome_wide": {
                "descriptive": ["GWAS Catalog", "UK Biobank", "GTEx"],
                "comparative": ["GWAS Catalog", "PhenoScanner", "Open Targets"],
                "predictive": ["PRS Catalog", "GWAS Catalog", "UK Biobank"],
                "causal": ["Open Targets", "DisGeNET", "STRING"]
            }
        }
    }
    
    try:
        return {
            "primary_databases": recommendations[question_type][data_scope][analysis_depth],
            "access_strategy": generate_access_strategy(question_type, data_scope),
            "integration_approach": suggest_integration_methods(data_scope, analysis_depth)
        }
    except KeyError:
        return {"error": "Invalid combination of parameters"}

def generate_access_strategy(question_type, data_scope):
    """データアクセス戦略の生成"""
    if data_scope in ["genome_wide", "multi_omics"]:
        return {
            "method": "bulk_download",
            "tools": ["FTP", "API", "rsync"],
            "preprocessing": "required",
            "storage": "local_database_recommended"
        }
    else:
        return {
            "method": "query_based",
            "tools": ["REST_API", "web_interface"],
            "preprocessing": "minimal",
            "storage": "cache_sufficient"
        }

def suggest_integration_methods(data_scope, analysis_depth):
    """データ統合手法の提案"""
    integration_matrix = {
        ("single_gene", "descriptive"): ["manual_curation", "simple_joins"],
        ("single_gene", "comparative"): ["orthology_mapping", "sequence_alignment"],
        ("pathway", "predictive"): ["network_analysis", "enrichment_analysis"],
        ("genome_wide", "causal"): ["mendelian_randomization", "colocalization"],
        ("multi_omics", "predictive"): ["multi_modal_ML", "network_integration"]
    }
    
    return integration_matrix.get((data_scope, analysis_depth), ["custom_integration"])

# 使用例
recommendation = classify_research_question("clinical", "single_gene", "predictive")
print(f"推奨データベース: {recommendation['primary_databases']}")
print(f"アクセス戦略: {recommendation['access_strategy']['method']}")

I.1.3 ExAC と gnomAD の位置づけ

ExAC（Exome Aggregation Consortium）は、2014年に大規模 exome データを集約して公開された歴史的な集団頻度リソースです。ExAC の成果は後続の gnomAD（Genome Aggregation Database）に拡張され、2016年に名称とスコープが広がりました。現在の variant / 変異頻度確認では、ExAC を過去の研究・論文の文脈を読むためのリソースとして扱い、通常は gnomAD の release、genome build、exome/genome、subset、filter 条件を記録して参照します。¹²

確認項目	記録する内容	理由
DB と release	例: gnomAD v4.1.1（確認日: 2026-05-12）	allele frequency は release に依存する。
genome build	GRCh37 / GRCh38 など	VCF座標・注釈・ClinVar等との突合に影響する。
データ種別	exome / genome、subset、filter 条件	観測領域・カバレッジ・除外条件が異なる。
集団頻度の使い方	AF、AN、AC、popmax など	希少性の確認には使えるが、病原性や臨床判断を単独では決められない。
provenance / 来歴	取得URL、取得日、checksum、利用条件	解析結果の再現性と監査に必要。

gnomAD などの集団頻度DBは、variant / 変異の「一般集団での観測頻度」を確認するための基盤です。研究・教育用途の注釈では有用ですが、臨床的意義の判断には ClinVar 等のキュレーション情報、対象疾患、表現型、家系情報、専門家レビュー、施設の品質管理が別途必要です。

MacArthur Lab, Genomic Data Aggregation（参照日: 2026-05-12） ↩
gnomAD, gnomAD v4.1.1（参照日: 2026-05-12） ↩