Retrieve reference genome for a taxonomic id from NCBI automaticaly
2
2
Entering edit mode
4.8 years ago
john ▴ 110

First consider the naive non-automatic approach to this question.

Lets use Escherichia coli str. K-12 substr. MG1655 as an example it has the taxonomic ID 511145. In the gnome DB of NCBI I can find this entry:

https://www.ncbi.nlm.nih.gov/genome/?term=txid511145[Organism:exp]

The first line says -> Reference genome: Escherichia coli str. K-12 substr. MG1655. Which links to:

https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521

Which is what I want. I can no use the RefSeq Accession to get the genome.

Now lets try this automatically:

esearch -db genome -query "txid511145[orgn]" | esummary


The result is the following:

<DocumentSummary><Id>167</Id>
<Organism_Name>Escherichia coli</Organism_Name>
<Organism_Kingdom>Bacteria</Organism_Kingdom>
<Organism_Group></Organism_Group>
<Organism_Subgroup>Gammaproteobacteria</Organism_Subgroup>
<Defline>A well-studied enteric bacterium</Defline>
<ProjectID>12319</ProjectID>
<Project_Accession>PRJNA12319</Project_Accession>
<Status>Complete</Status>
<Number_of_Chromosomes>3</Number_of_Chromosomes>
<Number_of_Plasmids>10</Number_of_Plasmids>
<Number_of_Organelles>0</Number_of_Organelles>
<Assembly_Name>ASM2632v1</Assembly_Name>
<Assembly_Accession>GCA_000026325.2</Assembly_Accession>
<AssemblyID>1181661</AssemblyID>
<Create_Date>1998/10/13 00:00</Create_Date>
<Options></Options>
<Weight>900</Weight>
<Chromosome_assemblies>484</Chromosome_assemblies>
<Scaffold_assemblies>9772</Scaffold_assemblies>
<SRA_genomes>0</SRA_genomes>
<TaxId>562</TaxId>
</DocumentSummary>


The first problem is that this querry brings me to species level (see TaxId 562) and not to the far more specific level of my query (strain).

The specific question in this case would be how can you link the TaxId 511145 to the genome_assembly_id 161521 using an esearch querry. Obvious the connection is there as I can make the link on the website.

NCBI esearch • 3.0k views
0
Entering edit mode

Are you trying to link E. coli taxonomy ID txid511145 to assembly ID 161521 which is a Mycobacterium tuberculosis assembly?

0
Entering edit mode

Hm not sure whats wrong here but consider my second url

https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521

This links to Ecoli. Maybe NCBI has different assembly ids for different DBs?

EDIT: You mean the ID in the NCBI database assembly. I mean the assembly id in the genome database.

0
Entering edit mode

While esearch -db genome -query "167" | esummary is linking to the right genome as you say the first line on the Ecoli page is linking to genome_assembly_id you provide a link for. Since there isn't a database for assemblies I wonder if you can't replicate this search via eutils (https://www.ncbi.nlm.nih.gov/genome/167?genome_assembly_id=161521 ) .

0
Entering edit mode
1. The query you provide (as my own) does not link to the right genome. It basicly links to all genomes belonging to the specie of ecoli. I want the best reference genome. In this case the genome of the strain K12. If there is no reference genome for a specific strain, than the best genome would be indeed the reference genome of the specie.
2. Thats the question how do I replicate this search with eutils.
1
Entering edit mode

It looks like there is an assembly database that can be queried. With this query you can still get multiple genomes that equate to E coli of your choice. You probably need to pick one. I don't think there is a best.

esearch -db assembly -query "Escherichia coli  str. K-12 substr. MG1655[ORGN]" | efetch -format docsum | xtract -pattern DocumentSummary -element Id -element Organism


As of today it generates:

1608811 Escherichia coli str. K-12 substr. MG1655 (E. coli)
1492261 Escherichia coli str. K-12 substr. MG1655 (E. coli)
644471  Escherichia coli str. K-12 substr. MG1655 (E. coli)
633811  Escherichia coli str. K-12 substr. MG1655 (E. coli)
630658  Escherichia coli str. K-12 substr. MG1655 (E. coli)
563298  Escherichia coli str. K-12 substr. MG1655 (E. coli)
547058  Escherichia coli str. K-12 substr. MG1655 (E. coli)
542158  Escherichia coli str. K-12 substr. MG1655star (E. coli)
499361  Escherichia coli str. K-12 substr. MG1655 (E. coli)
388638  Escherichia coli str. K-12 substr. MG1655 (E. coli)
247231  synthetic Escherichia coli C321.deltaA (E. coli)
247221  synthetic Escherichia coli C321.deltaA (E. coli)
234141  Escherichia coli str. K-12 substr. MG1655 (E. coli)
91231   Escherichia coli str. K-12 substr. MG1655 (E. coli)
79781   Escherichia coli str. K-12 substr. MG1655 (E. coli)
71571   Escherichia coli str. K-12 substr. MG1655 (E. coli)
67201   synthetic Escherichia coli C321.deltaA (E. coli)
6498    Escherichia coli str. K-12 substr. MG1655 (E. coli)

0
Entering edit mode

Uh NCBI is a mess. "Best" is the wrong word, but there is an "official" one. At least one genome is linked with the taxid by the way I described. But this is also probably a little arbitrary and nobody is really checking which assembly should be the representative for the specie. With the way you showed I could simply always take the first entry (which is the newest).

But testing this with other bacteria this query gives not really satisfying results. Example 525338 aka Lactobacillus plantarum subsp. plantarum ATCC. Your query returns three entries for which non links to the entry I would find in the genome db by hand. Which is a far better assembly. The one I get from the assembly db with your query is still in multiple scaffolds.

https://www.ncbi.nlm.nih.gov/nuccore/AZEJ00000000.1/

vs

https://www.ncbi.nlm.nih.gov/nuccore/NC_004567.2

So it seems like I would prefer the genome db instead of the assembly db.

1
Entering edit mode

I wouldn't say that NCBI is a mess, but their reasons and criteria for selecting a "reference genome" for each "species" may not be in line with your reasons for wanting a reference genome, and the concept of "species" varies a lot between different types of organisms. Bacteria and viruses don't usually have "species boundaries" that are as easy to define as they are for some eukaryotes such as vertebrates or plants.

Even within the bacteria, the idea of what we name as one "species" in the enterobacteria such as Escherichia coli, is far different than in other types of bacteria such as the Clostridia. There are many Clostridia that got the name "Clostridium botulinum" despite the fact that they are far more diverse than the distance between Escherichia, Salmonella and Shigella.

The NCBI "reference genomes" are duplicates of author-provided genomes. My understanding is that they created these duplicates with NC_nnn accession numbers, so that they (NCBI) could control the annotation whereas only the authors of every other record are in charge of the annotations.

0
Entering edit mode

Probably there is a reason that NCBI is what it is and much like in biology just makes sense in the light of history/ evolution.

But still NCBI made the effort to provide the public for every txid of a species or strain (and I don't want to go into the discussion what these two terms mean. For me it is just important that these things simply exist, hence can be grown in a petri dish and are not just a node in a tree) with a reference genome. And thats the only thing I want. I know how to do tha by hand but cant figure out a way how to do it automatically.

If there would be a workaround with the assembly db I would use that but it lacks what the genome db offers, a preselection for a "good" genome. In the case of Lacotbacillus ATCC there is now "good" assembly hence they linked a "better" one from close strain/ specie.

0
Entering edit mode

I don't think we are going to solve this dilemma here. There will always be more than one genome for a TaxID. Much as we want things to work automatically NCBI has finite resources and may not have the expertise to decide the "best" representative.

It is easy to gripe about NCBI but remember much of this information may be locked behind paywalls if NCBI did not exist.

0
Entering edit mode

But they do it! They provide a reference genome per taxid! I showed that in the question!

You simply take the URL

https://www.ncbi.nlm.nih.gov/genome/?term=txidFOOBAR[Organism:exp]

Than go one the right side to genome. Hit the link in the first line:

Reference genome: FOOBAR

And voila you are there. And the question is how do I do that automatically? I mean I can write a scrip that greps the right link form the HTMLs. But that is ridiculous, there should be a better way.

0
Entering edit mode

See if you find @5heikki's solution usable. You can always send a ticket in to NCBI's help desk asking the question above. There is always knowledge that may not have been documented anywhere that the owners know of.

3
Entering edit mode
4.8 years ago
5heikki 11k
#!/bin/bash
TAXID="$1" if [[ -z "$TAXID" ]]; then
echo "Usage: retrieveReference.sh NcbiTaxId"
exit
else
if [[ ! -e assembly_summary_genbank.txt ]]; then
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
fi
awk -v TAXID="$TAXID" \ 'BEGIN{FS="\t"}{if($5=="reference genome" && $6==TAXID){print$20}}' \
assembly_summary_genbank.txt \
| awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' \
| xargs -n1 wget
fi

0
Entering edit mode

Hey nice job. Almost perfect, the script just fails in a few cases when there is no good assembly so NCBI did not give it the status as a official reference genome. In this case they seem to provide the last assembly in the genome db. So I extend your script to do that.

#!/bin/bash
TAXID="$1" if [[ -z "$TAXID" ]]; then
echo "Usage: retrieveReference.sh NcbiTaxId"
exit
else
if [[ ! -e assembly_summary_genbank.txt ]]; then
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
fi
reference=$(awk -v TAXID="$TAXID" \
'BEGIN{FS="\t"}{if($5=="reference genome" &&$6==TAXID){print $20}}' \ assembly_summary_genbank.txt) if [ -z "$reference" ]; then
reference=$(awk -v TAXID="$TAXID" \
'BEGIN{FS="\t"}{if($6==TAXID){print$20; exit}}' \
assembly_summary_genbank.txt)
fi
echo $reference | awk 'BEGIN{OFS=FS="/"}{print$0,$NF"_genomic.fna.gz"}' | xargs -n1 wget fi  EDIT: Such cases like Taxid: 411490, 1121114 ADD REPLY 0 Entering edit mode In those two cases the first match has "representative genome" in the fifth column. I don't know if NCBI has ordered the file that way. Perhaps it would be wiser to filter for that if "reference genome" fails. Then if you still get empty query you can take the first match or alternatively look into the assembly_level field (#12), e.g. presumably "Complete Genome" > "Chromosome" > "Scaffold" > "Contig" ADD REPLY 0 Entering edit mode I think I will simply try reference genome and representative genome. Like so: #!/bin/bash TAXID="$1"
if [[ -z "$TAXID" ]]; then echo "Usage: retrieveReference.sh NcbiTaxId" exit else if [[ ! -e assembly_summary_genbank.txt ]]; then wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt fi awk -v TAXID="$TAXID" \
'BEGIN{FS="\t"}{if(($5=="reference genome" ||$5=="representative genome") && $6==TAXID){print$20}}' \
assembly_summary_genbank.txt |\
awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' | xargs -n1 wget
fi

0
Entering edit mode

But then you might download representative over reference or vice versa randomly (depending how the file is sorted). Of course that might not matter. You could sort the assembly summary file first based on that column. At least then it would be consistent for sure..

0
Entering edit mode

Not if you checked the sixth column for txid there it is either "reference genome" or "representative genome". At least in the current version. I think it will be different if you use the seventh column (txid of the specie). I think that is why they have the term "reference genome" and "representative genome". One is the reference for all strain of a specie and the other is the reference for the strain. But I think I have to still modify it because it is possible that it still can return an empty string. But I will do this next week. We have holidays now in Germany. Happy Eastern!

                  __                   __
| _'-._           _.-'_ |
| '::. '.       .' .::' |
\ '::\  \     /  /::' /
\  ':\  |   |  /:'  /
'._   '---'   _.'
) __     __ (
/ /  \   /  \ \
/  \_0/   \0_/  \
=/       .-.       \=
=| .'     \_/     '. |=
_                =\ '      |      ' /=
\-._              '.__ --'-- __.'
/-_^-'-._         /   \_______/   \
>-"=_=_-~_=.,==,=|     /==,==,=\     |.,_
\~- >_<_"-~-/  /  /\,,,/  /  /   \,,,//  /"=._
<_"- ~-_>-";  ;  _; _ ;  ;  ;  ;  ; _; _:   /  /;=,__,
/=~_->"-_~-;  |_/ \| \|  |  |  |  |/ |/ \_ ;  :  ; _,='
>-"<^-^">_";  / \()()|;  ;  ;  ;  ;|()()/ \ \ _;="
\-_~-_>~-_.=\ \() _  | \  \  \  \  |    ()/="
<,jgs_.-   =\ / \|=========|/ \ /
/_.-'          \\  ||             ||  //
\'-'/             \'-'/
"               "

0
Entering edit mode
4.8 years ago
btf ▴ 10

The NCBI RefSeq Database:

Might be what you want. But searching for https://www.ncbi.nlm.nih.gov/nuccore/?term=Escherichia+coli+str.+K-12+substr.+MG1655++AND+srcdb_refseq%5BPROP%5D pulls up 106 highly similar but unique RefSeq entries for this one strain (Escherichia coli str. K-12 substr. MG1655). Only a few of them are complete genomes.

Searching on just "Escherichia coli" in RefSeq pulls up 2,36 million records of which only a few are complete genomes.

0
Entering edit mode

At first glance, a complete genome such as: LOCUS NZ_CP026028 4639673 bp DNA circular CON 27-JAN-2018 DEFINITION Escherichia coli strain CIT chromosome, complete genome. does not seem to be K-12 substr. MG1655, but down in the features table there is a note stating that the genome is from this substrain: /note="Escherichia coli str. K-12 substr. MG1655 isolated after growth in presence of sub-inhibitory concentrations of citral"

I am not really an expert on this, but I know the long term evolution of E. coli project http://myxo.css.msu.edu/ecoli/ has studied changes in one clone of E. coli over more than 50,000 generations and they provide comparisons to the MG1655 genome.

I find only one complete genome linked to it: https://www.ncbi.nlm.nih.gov/nuccore/NC_002695.1 but this is E. coli strain Sakai and not K-12 substrain 1655

So I have no idea what is going on with all of this.

0
Entering edit mode

Hm maybe I should clarify my problem a bit. I have a list of txid and I want to download for each one of them 1 genome. The list is to big to do it by hand and also might change and must be redone. Hence I want to make it automatically. I dont want to decide which genome to take if there a multiple ones. I know I might missing something in these way but so be it. NCBI by the link between Taxonomy DB and Genome DB provides such a selection. There is always one genome per txid. So preferably I would simply take this one, but I am okay with an simple hack over assembly db or nuccore db.

0
Entering edit mode

Yes, I thought that was what you were after, I am just a little confused that this "one link" from txid 511145 which is stated to be K12 substrain 1655 links to a genome that is not from that substrain at all. This seems odd.