Question: Replace Accesion Headers by full ncbi Lineage Taxnomy
0
gravatar for dllopezr
20 months ago by
dllopezr60
dllopezr60 wrote:

Hi everybody

I have several multifasta files with headers like this:

>NC_014760.1:c365195-363993 Mycoplasma bovis PG45 clone MU clone A2, complete genome
>NC_013511.1:c476217-475021 Mycoplasma hominis ATCC 23114 chromosome complete genome

I would like to replace this headers by ncbi full taxonomy:

>Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae,Bacillus,Bacillus anthracis

Can you help me with that?

This question FASTA file with taxid, want to obtain lineage and put into header is almost the same that I have, but in there the user had ncbi tax ID instead of accesion number-

Thank you

sequence gene • 544 views
ADD COMMENTlink modified 20 months ago by shenwei3565.1k • written 20 months ago by dllopezr60

Hello dllopezr,

could you please guide me where the full taxonomy is from and how it is connected to your current headers?

fin swimmer

ADD REPLYlink modified 20 months ago • written 20 months ago by finswimmer13k

The NCBI taxonomy always works in combination with the tax ID. I use the files merged.dmp and rankedlineage.dmp from here ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip and ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.

ADD REPLYlink written 20 months ago by gb1.7k
2
gravatar for shenwei356
20 months ago by
shenwei3565.1k
China
shenwei3565.1k wrote:

Dataset and tools:

Here we go:

  1. Few lines of nucl_gb.accession2taxid.gz

    $ zcat nucl_gb.accession2taxid.gz | head -n 3 | column -t
    accession  accession.version  taxid  gi
    A00002     A00002.1           9913   2
    A00003     A00003.1           9913   3
    
  2. Retrieving the accession from fasta file

    $ cat seqs.fa 
    >NC_014760.1:c365195-363993 Mycoplasma bovis PG45 clone MU clone A2, complete genome
    actg
    >NC_013511.1:c476217-475021 Mycoplasma hominis ATCC 23114 chromosome complete genome
    ACTN
    
    $ seqkit seq -n -i --id-regexp '^(.+):' seqs.fa | tee seqs.fa.id
    NC_014760.1
    NC_013511.1
    
  3. Searching the taxids

    $ time csvtk grep -t -f accession.version -P seqs.fa.id nucl_gb.accession2taxid.gz \
        | csvtk cut -t -f accession.version,taxid | tee seqs.fa.id2taxid
    
    NC_013511.1     347256
    NC_014760.1     289397
    
    real    0m50.546s
    user    1m44.892s
    sys     0m4.160s
    
    # faster using shell grep,
    $ time zcat nucl_gb.accession2taxid.gz \
        | grep -w  -f seqs.fa.id | cut -f 2,3 | tee seqs.fa.id2taxid
    NC_013511.1     347256
    NC_014760.1     289397
    
    real    0m19.034s
    user    0m23.868s
    sys     0m1.384s
    
  4. Computing lineage

    $ time cat seqs.fa.id2taxid \
        | taxonkit lineage -i 2 \
        | sed 1d | cut -f 1,3 | tee seqs.fa.id2taxid2lineage
    
    NC_013511.1     cellular organisms;Bacteria;Terrabacteria group;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma hominis;Mycoplasma hominis ATCC 23114
    NC_014760.1     cellular organisms;Bacteria;Terrabacteria group;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis;Mycoplasma bovis PG45
    
    real    0m3.200s
    user    0m5.392s
    sys     0m0.224s
    
  5. Replacing fasta header

    $ seqkit replace -k seqs.fa.id2taxid2lineage -p '^(.+):.+' -r '{kv}' seqs.fa \
        | seqkit replace -p ';' -r ','
    
    >cellular organisms,Bacteria,Terrabacteria group,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma bovis,Mycoplasma bovis PG45
    actg
    >cellular organisms,Bacteria,Terrabacteria group,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma hominis,Mycoplasma hominis ATCC 23114
    ACTN
    

You want the lineage in format of "kingdom, phylum ... species" instead of full lineage? Fine:

$ time cat seqs.fa.id2taxid \
    | taxonkit lineage -i 2 \
    | taxonkit reformat -i 3 
    | sed 1d | cut -f 1,4 | tee seqs.fa.id2taxid2lineage2

NC_013511.1     Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma hominis
NC_014760.1     Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovis

real    0m5.183s
user    0m13.920s
sys     0m0.764s

$ seqkit replace -k seqs.fa.id2taxid2lineage2 -p '^(.+):.+' -r '{kv}' seqs.fa \
    | seqkit replace -p ';' -r ','

>Bacteria,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma bovis
actg
>Bacteria,Tenericutes,Mollicutes,Mycoplasmatales,Mycoplasmataceae,Mycoplasma,Mycoplasma hominis
ACTN

It's exact what you want.

ADD COMMENTlink modified 20 months ago • written 20 months ago by shenwei3565.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2283 users visited in the last hour