Rで解析：Entrezの操作が楽々です！「rentrez」パッケージ

Entrezは論文のデータベースであるPubmedでおなじみのNational Center for Biotechnology Information (NCBI)が開発した検索システムです。あまりにも有名なのでEntrezの説明はしませんが、論文だけでなく、SNPも含めた遺伝子情報のデータベースもあり、J-STAGE、JDreamⅢともに重宝しています。

Entrezもそうですが、多くのウェブ上のデータベースはソフトウェアから操作するためのAPIを公開しています。APIを利用してデータを取得すると作業効率が格段に上昇します。

本パッケージはEntrezのAPIを操作してデータを取得するパッケージです。一瞬で、必要なデータを取得できます。

ただし、大量のデータを取得する行為は迷惑となりますので十分に注意してください。

パッケージのバージョンは1.2.3。実行コマンドはR version 4.2.2で確認しています。

パッケージのインストール

下記コマンドを実行してください。

#パッケージのインストール
install.packages("rentrez")

実行コマンド

詳細はコメント、パッケージヘルプを確認してください。

#パッケージの読み込み
library("rentrez")

#データベースの一覧を取得:entrez_absコマンド
entrez_dbs()
[1] "pubmed"          "protein"         "nuccore"         "nucleotide"      "nucgss"
[6] "nucest"          "structure"       "genome"          "gpipe"           "annotinfo"
[11] "assembly"        "bioproject"      "biosample"       "blastdbinfo"     "books"
[16] "cdd"             "clinvar"         "clone"           "gap"             "gapplus"
[21] "grasp"           "dbvar"           "epigenomics"     "gene"            "gds"
[26] "geoprofiles"     "homologene"      "medgen"          "mesh"            "ncbisearch"
[31] "nlmcatalog"      "omim"            "orgtrack"        "pmc"             "popset"
[36] "probe"           "proteinclusters" "pcassay"         "biosystems"      "pccompound"
[41] "pcsubstance"     "pubmedhealth"    "seqannot"        "snp"             "sra"
[46] "taxonomy"        "unigene"         "gencoll"         "gtr" 

#データベース要約を表示:entrez_db_summaryコマンド
entrez_db_summary("gene")

DbName: gene
MenuName: Gene
Description: Gene database
DbBuild: Build230112-2235m.1
Count: 63762982
LastUpdate: 2023/01/13 17:52 

#検索で使用できるオプションを表示:entrez_db_searchableコマンド
entrez_db_searchable("gene")

Searchable fields for database 'gene'
ALL 	 All terms from all searchable fields 
UID 	 Unique number assigned to a gene record 
FILT 	 Limits the records 
TITL 	 gene or protein name 
WORD 	 Free text associated with record 
ORGN 	 scientific and common names of organism 
MDAT 	 The last date on which the record was updated 
CHR 	 Chromosome number or numbers; also 'mitochondrial', 'unknown' properties 
MV 	 Chromosomal map location as displayed in MapViewer 
GENE 	 Symbol or symbols of the gene 
ECNO 	 EC number for enzyme or CAS registry number 
MIM 	 MIM number from OMIM 
DIS 	 Name(s) of diseases associated with this gene. When available, OMIM name will be used 
ACCN 	 Nucleotide or protein accession(s) associated with this gene 
UGEN 	 UniGene cluster number for this gene 
PROP 	 Properties of Gene record 
CDAT 	 The date on which this record first appeared 
NCAC 	 nucleotide accessions of sequences 
NUID 	 nucleotide uids of sequences 
PACC 	 protein accessions 
PUID 	 protein uids 
PMID 	 PubMed ids of accessions linked to the record 
TID 	 taxonomy id 
GO 	 Gene Ontology 
DOM 	 Domain Name 
DDAT 	 The date on which the record was discontinued 
CPOS 	 Chromosome base position 
GFN 	 Gene full name 
PFN 	 Protein full name 
GL 	 Gene length 
XC 	 Exon count 
GRP 	 Relationships for this gene 
PREF 	 Preferred symbol of the gene 
AACC 	 Assembly accession 
ASM 	 Assembly name 
EXPR 	 Gene expression 

#キーワードで検索:entrez_searchコマンド
#termはEntrezの形式がそのまま使用できます
#データベースの指定:dbオプション
#結果取得数の指定:retmaxオプション
r_search <- entrez_search(db = "gene", term = "MTHFR AND Homo sapiens[ORGN]", retmax = NULL)
#確認
r_search

Entrez search result with 196 hits (object contains 20 IDs and no web_history object)
Search term (as translated):  MTHFR[All Fields] AND "Homo sapiens"[Organism] 

#データ構造の確認
str(r_search)
List of 5
$ ids             : chr [1:20] "7157" "348" "7124" "7422" ...
$ count           : int 196
$ retmax          : int 20
$ QueryTranslation: chr "MTHFR[All Fields] AND \"Homo sapiens\"[Organism]"
$ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
  - attr(*, "class")= chr [1:2] "esearch" "list"

#検索結果の要約を取得:entrez_summarコマンド
#idは取得結果とデータが格納されているリスト名の組み合わせ
GeneResult <- entrez_summary(db = "gene", id = r_search$ids)

#確認
$`7157`
esummary result with 20 items:
[1] uid                name               description        status             currentid         
[6] chromosome         geneticsource      maplocation        otheraliases       otherdesignations 
[11] nomenclaturesymbol nomenclaturename   nomenclaturestatus mim                genomicinfo       
[16] geneweight         summary            chrsort            chrstart           organism    

#以下略
#例えば一番目結果のdescriptionを表示
sapply(GeneResult, "[[", "description")[1]
　　　　　　　7157 
"tumor protein p53" 

###パッケージ公式ページより#####
#特定キーワードの年代別論文数を取得コマンドを一部改変
#serch_yearコマンドの作成
search_year <- function(year, term){
  query <- paste(term, "AND (", year, "[PDAT])")
  entrez_search(db = "pubmed", term = query, retmax = 0)$count
}
#年範囲を指定
year <- 2008:2022
papers <- sapply(year, search_year,
                 term = "MTHFR AND Homo sapiens[ORGN]", 
                 USE.NAMES = FALSE)
plot(year, papers, type = 'b', main = "MTHFR AND Homo sapiens[ORGN]:PAPERS",
     ylim = c(0, type.convert(max(papers))*1.2))

#特定キーワードの論文を取得し情報をまとめる
SearchResult <- entrez_search(db = "pubmed", retmax = 10,
                              term = "folic acid AND MTHFR")
#ジャーナル名の取得
GetSummary <- entrez_summary(db="pubmed", id=SearchResult$ids)
#タイトルを取得
sapply(GetSummary, "[[", "title")[1]
36637428 
#"Methylenetetrahydrofolate reductase deficiency and high dose FA
#supplementation disrupt embryonic development of energy balance and metabolic
#homeostasis in zebrafish." 

#作業フォルダにマークダウンしてHTMLテーブルで出力
#体裁は整えていません
#entrez_summaryの結果はリストで返されます
#通常のリストに対する操作でデータをいじれます
cat(knitr::kable(sapply(GetSummary, "[[", "title"),
                 row.names = TRUE, caption = "TEST table",
                 col.names = c("TiTle"), format = "html"),
    file = "TEST.html", sep = "\n")