Question

Fasta Taxonomy Annotation

0

Entering edit mode

3.8 years ago

zach ▴ 10

Hi everyone!

I am looking to taxonomically annotate a fasta sequence file and receive a fasta output with annotation. The original pacbio_otu.fasta has the id lines:

> consensus=Uniq2;size=24;seqs=2
GTTACCTTGTTACGACTTCACCCCAATCATCTATCCCACCTTAGGCGGCTGGCTCCAAAAGGTTACCTCACCGACTTCGG

To annotate pacbio_otu.fasta, the taxonomy database rdp_16s_v16_sp.fa has the id lines:

> EF599163_S000871589;tax=d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:"Vibrionales",f:Vibrionaceae
GTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGAAACGACACTAACAATCCTTC

If possible, I would like to have taxonomy annotation (from rdp_16s_v16_sp.fa) on my pacbio_otu.fasta file to build my own taxonomy database in fasta format with the id lines like:

> consensus=Uniq2;size=24;seqs=2;tax=d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:"Vibrionales",f:Vibrionaceae
GTTACCTTGTTACGACTTCACCCCAATCATCTATCCCACCTTAGGCGGCTGGCTCCAAAAGGTTACCTCACCGACTTCGG

Eventually, with this taxonomy database in fasta format, I would like to run usearch 'sintax' with other fasta data against it.

For my situation, are there any ways or scripts to produce my own taxonomy database in fasta format?

Many thanks, Zach

fasta annotation taxonomy • 1.8k views

ADD COMMENT • link updated 3.7 years ago by h.mon 35k • written 3.8 years ago by zach ▴ 10

1

Entering edit mode

Hi Zach,

A fasta file is a file with one header line, that starts with the sign >, followed by a sequence (DNA, RNA, protein), such as:

>OTU_1

ATCGATGCTAGCTACGATCGATCAGCTAGCTGATCGATCGATGCATCGATC

Therefore the two header file that you're requesting is not in fasta format, because you have: 1st line - header, 2nd line - taxonomy, and 3rd line - sequence. Thus, even if you create that strange format usearch will probably complain and throw you errors saying that your data is not in fasta format.

You have two options here: (1) stick with the file annotated like

>EF599163_S000871589;tax=d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:"Vibrionales",f:Vibrionaceae
GTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGAAACGACACTAACAATCCTTC

Or (2) keep a fasta file untouch and the taxonomy in a text file with 2 columns that match headers and taxonomy.

I hope this help,

António

ADD REPLY • link 3.8 years ago by antonioggsousa 3.2k

0

Entering edit mode

Do you know what this EF599163_S000871589 means or come from?

My guess is that you should have a file from usearch matching EF599163_S000871589 with Uniq2, but I'm not sure. I don't use usearch for a long time.

António

ADD REPLY • link 3.8 years ago by antonioggsousa 3.2k

1

Entering edit mode

Hi Antonio,

Thanks for your response. I have previously done usearch-sintax with other fasta files on rdp_16s_v16_sp.fa as a database, without any problems.

What I want to do is annotate a PacBio fasta file of mine (pacbio_otu.fasta) to get a new taxonomy-annotated fasta file with lines like this:

consensus=Uniq2;size=24;seqs=2;tax=d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:"Vibrionales",f:Vibrionaceae GTTACCTTGTTACGACTTCACCCCAATCATCTATCCCACCTTAGGCGGCTGGCTCCAAAAGGTTACCTCACCGACTTCGG

I do not have an annotated fasta file like the above, and am looking to have that.

'EF599163_S000871589' should represent a particular OTU. The RDP taxonomy database (rdp_16s_v16_sp.fa) was obtained from https://drive5.com/usearch/manual/sintax_downloads.html

Cheers!

ADD REPLY • link 3.8 years ago by zach ▴ 10

score 0 · Answer 1 · 2020-08-06

For my situation, are there any ways or scripts to produce my own taxonomy database in fasta format?

Although you can use any taxonomic classification pipeline to "annotate" your own fasta file (I would use DADA2+phyloseq for this, but a huge combination of tools could do the work), your annotated fasta will be no better than the original RDP database you used to classify your own sequences, and can even introduce wrong classifications. So my advice is to just use the RDP fasta, or any other curated database you deem appropriate.