Question: How to retrieve information on bacterial source from NCBI?
0
gravatar for saadleeshehreen
11 weeks ago by
saadleeshehreen60 wrote:

Hi,

I want to get the isolation source (clinical/environmental) information for all RefSeq Pseudomonas aeruginosa genomes. Roughly, I know around 2000 sequenced Pseudomonas aeruginosa are available in NCBI. Sometimes the isolation source are mentioned in the Biosample e.g. https://www.ncbi.nlm.nih.gov/biosample/SAMN02732279/ . As I want to get the info at a time for 2000 genomes, how can I retrieve it by using bash? Any known script for this purpose?

Cheers

ADD COMMENTlink modified 11 weeks ago by vkkodali940 • written 11 weeks ago by saadleeshehreen60
3
gravatar for vkkodali
11 weeks ago by
vkkodali940
United States
vkkodali940 wrote:

You can use Entrez Direct for this. As you know, not all of the BioSample entries have all of the information you want and even when they do, it is not always under the same attribute. You may want to look at the XML output of esummary and come up with a suitable xtract command that will fetch all of the fields you want. As an example, you can use the following query to fetch the name, Biosample accession and the isolation source in a three column tab-delimited format:

## WARNING: returns >3000 results; only first five are shown here
esearch -db assembly -q '"Pseudomonas aeruginosa"[Organism] AND latest_refseq[filter]' \
    | elink -db assembly -target biosample -name assembly_biosample \
    | esummary \
    | xtract -pattern DocumentSummary -first Title -element Accession \
        -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute

Pseudomonas aeruginosa CLJ1     SAMN07372049    lungs (tracheal aspirate)
Pseudomonas aeruginosa CLJ3     SAMN07372048    lungs (tracheal aspirate)
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374626    skin
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374625    Bronchial aspirate
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374624    Biopsy
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by vkkodali940

Thanks for that. It works for me, but I got only 1000 results. I have a table with all assembly_ID (eg. GCF_000006765) of Pseudomonas aeruginosa, so I need to map back this table. How can I map back assembly id with biosample accession?

ADD REPLYlink written 11 weeks ago by saadleeshehreen60

but I got only 1000 results

Could this be because a large number of the Biosample entries lack isolation_source information? If you run the command as shown above, you should see >3000 rows in the results but the cases lacking isolation source information will only have two columns of data instead of three. You can pick out a few of those and go digging around in the Biosample DocumentSummary XML for other attributes that may be of use to you.

How can I map back assembly id with biosample accession?

You can use Entrez Direct for this as shown below. Once you have this table for all of your data, you can join it to the one with isolation source results on column 2.

esearch -db assembly -q 'GCF_000006765' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn
GCF_000006765.1 SAMN02603714
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by vkkodali940

Hi vkkodali ! Could you please post a tutorial how to annotate a bacterial assembly using NCBI eutils? If possible, both online and offline annotation. This would help many visitors here.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by cpad011211k

One solution, I have just got: esearch -db assembly -query GCF_000647595.2 | elink -related -cmd neighbor -name assembly_biosample | xml2 | grep "/eLinkResult/LinkSet/LinkSetDb/Link/Id=" | awk 'BEGIN{FS="="} {print $2}' You just need a xml2 to download.

ADD REPLYlink written 11 weeks ago by saadleeshehreen60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1147 users visited in the last hour