Extract country information of a fasta sequence on NCBI website using renterz
2
0
Entering edit mode
3.4 years ago
kelvinfrog75 ▴ 10

I want to extract the country location of a bunch of sequences in the NCBI using renterz. I have their accession numbers but I have the trouble of getting the country info. For example, I have this accession number MH939154 and I need to extract Romania using rentrez.

 source          1..10976
                 /organism="West Nile virus"
                 /mol_type="genomic RNA"
                 /strain="DD84c"
                 /host="Culex pipiens s.l."
                 /db_xref="taxon:11082"
                 /country="Romania"
                 /collection_date="2014"
                 /note="lineage 2"

I have tried the code below but it seems like it will only extract the countries related to publication. So I wonder if there is any way to get the country under the source.

id = "MH939154.1"
db = entrez_fetch(db= "pubmed", id = id, rettype = "xml")
xml <- read_xml(db)
recs <- xml_find_all(xml, "//Country")
R NCBI rentrez location • 1.3k views
ADD COMMENT
2
Entering edit mode
3.4 years ago

i don't know r+xml , so using a XPATH expression:

$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MH939154&rettype=gb&retmode=xml"  | xmllint  --xpath '//GBQualifier[GBQualifier_name="country"]/GBQualifier_value/text()' - && echo

Romania
ADD COMMENT
0
Entering edit mode

This seems to work fine. I can run this command inside R. Just wonder how do you get this link "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MH939154&rettype=gb&retmode=xml" ? Thanks.

ADD REPLY
0
Entering edit mode

Just wonder how do you get this link

https://www.ncbi.nlm.nih.gov/books/NBK25500/

ADD REPLY
0
Entering edit mode

Great. I am able to integrate the command script and get the country info. Thanks!

ADD REPLY
1
Entering edit mode
3.4 years ago
GenoMax 141k

This can also be obtained by using Entrez Direct:

$ esearch -db nuccore -query "MH939154" | esummary | xtract -pattern DocumentSummary -element SubName
DD84c|Culex pipiens s.l.|WNV|Romania|2014|lineage 2

4th field is Country. I will leave it for you to extract that.

ADD COMMENT

Login before adding your answer.

Traffic: 3156 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6