How to retrieve information on bacterial source from NCBI?
1
0
Entering edit mode
3.0 years ago

Hi,

I want to get the isolation source (clinical/environmental) information for all RefSeq Pseudomonas aeruginosa genomes. Roughly, I know around 2000 sequenced Pseudomonas aeruginosa are available in NCBI. Sometimes the isolation source are mentioned in the Biosample e.g. https://www.ncbi.nlm.nih.gov/biosample/SAMN02732279/ . As I want to get the info at a time for 2000 genomes, how can I retrieve it by using bash? Any known script for this purpose?

Cheers

retrieve NCBI isolationsource bacteria • 1.4k views
3
Entering edit mode
3.0 years ago
vkkodali ★ 2.8k

You can use Entrez Direct for this. As you know, not all of the BioSample entries have all of the information you want and even when they do, it is not always under the same attribute. You may want to look at the XML output of esummary and come up with a suitable xtract command that will fetch all of the fields you want. As an example, you can use the following query to fetch the name, Biosample accession and the isolation source in a three column tab-delimited format:

## WARNING: returns >3000 results; only first five are shown here
esearch -db assembly -q '"Pseudomonas aeruginosa"[Organism] AND latest_refseq[filter]' \
| elink -db assembly -target biosample -name assembly_biosample \
| esummary \
| xtract -pattern DocumentSummary -first Title -element Accession \
-group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute

Pseudomonas aeruginosa CLJ1     SAMN07372049    lungs (tracheal aspirate)
Pseudomonas aeruginosa CLJ3     SAMN07372048    lungs (tracheal aspirate)
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374626    skin
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374625    Bronchial aspirate
Pathogen: clinical or host-associated sample from Pseudomonas aeruginosa        SAMN10374624    Biopsy

0
Entering edit mode

Thanks for that. It works for me, but I got only 1000 results. I have a table with all assembly_ID (eg. GCF_000006765) of Pseudomonas aeruginosa, so I need to map back this table. How can I map back assembly id with biosample accession?

0
Entering edit mode

but I got only 1000 results

Could this be because a large number of the Biosample entries lack isolation_source information? If you run the command as shown above, you should see >3000 rows in the results but the cases lacking isolation source information will only have two columns of data instead of three. You can pick out a few of those and go digging around in the Biosample DocumentSummary XML for other attributes that may be of use to you.

How can I map back assembly id with biosample accession?

You can use Entrez Direct for this as shown below. Once you have this table for all of your data, you can join it to the one with isolation source results on column 2.

esearch -db assembly -q 'GCF_000006765' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,BioSampleAccn
GCF_000006765.1 SAMN02603714

0
Entering edit mode

Hi vkkodali ! Could you please post a tutorial how to annotate a bacterial assembly using NCBI eutils? If possible, both online and offline annotation. This would help many visitors here.

0
Entering edit mode

One solution, I have just got:

esearch -db assembly -query GCF_000647595.2 | elink -related -cmd neighbor -name assembly_biosample | xml2 | grep "/eLinkResult/LinkSet/LinkSetDb/Link/Id=" | awk 'BEGIN{FS="="} {print \$2}'


You just need a xml2 to download.