Question: Add taxonomic info to FASTA headers from csv file
gravatar for mollysil
21 months ago by
mollysil0 wrote:

I want to change my FASTA headers (in file named VT.fasta) to reflect taxonomic information. Right now they have accession information only, like this:





I want the headers to include taxonomic information (Genus species) as well. This information is stored in a separate comma delineated csv file (VTtaxonomy.csv) like this:

GenBank code,VT,DNA,Class,Order,Family,Genus,Species



The fasta file and csv file are in the exact same order and there are 352 total header IDs that need to have the taxonomic information added on. I want the new fasta file (NewVT.fasta) to look like this:

gb|AJ854100_VTX00001| Paraglomus sp.


gb|AF202299_VTX00002| Paraglomus sp.


I’m using Linux. This is probably a simple bash command (I hope!) but I just can’t figure it out. Thank you for your help!

ADD COMMENTlink modified 21 months ago by Bastien Hervé4.5k • written 21 months ago by mollysil0

Any Perl solutions perhaps?

ADD REPLYlink written 21 months ago by mollysil0
gravatar for Bastien Hervé
21 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:

Python solution :

from Bio import SeqIO
import csv


with open('VTtaxonomy.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    for row in spamreader:
        genus_species_array.append(row[6]+" "+row[7])


for record in SeqIO.parse('VT.fasta', 'fasta'): = +"| "+ genus_species_array[count]
    record.description = ""
    with open("NewVT.fasta", "a") as output_handle:
        SeqIO.write(record, output_handle, 'fasta')
ADD COMMENTlink modified 21 months ago • written 21 months ago by Bastien Hervé4.5k

Thank you for such a detailed response. I assume there is a way to do this with a simple perl or python script? Or do you think I could use the join command maybe (can this be used with a FASTA and CSV file, as opposed to a TXT file)? I can't get access to Biopython on the server I'm using to run my data.

ADD REPLYlink written 20 months ago by mollysil0

I don't really get whats you really need, a perl script or the same python script without Biopython ?

In the first way, I'm not good enought in perl to help you. If you know how to Perl, I assume you can do the translation of the script above.

In the other way, you can try to modify the second part of my script and not using Biopython. You need to open the VT.fasta.

1) Loop over lines

2) foreach line starting with ">" you keep the name of the sequence

3) Then, according your fasta architecture (fasta is a wild animal : A: Is There A Precise Specification For Fasta Files? ) (sequence in one single line, sequence splited per 60 bases, 70 bases, even 80) you try to retrieve the complete sequence.

4) Then, same technique : you modify the name of the sequence, like I did here : +"| "+ genus_species_array[count]

5) You save name of your sequence and sequence in a new file.

6) Thereafter, go to step 2, etc...

There is maybe a way to do it in shell, good luck with that because the command will be huge and not easy to understand. I think that a script is way more understandable, but it's all about your convenience.

What do you mean by "as opposed to a TXT file" ?

ADD REPLYlink modified 20 months ago • written 20 months ago by Bastien Hervé4.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2389 users visited in the last hour