Question

Add taxonomic info to FASTA headers from csv file

0

Entering edit mode

6.2 years ago

mollysil ▴ 40

I want to change my FASTA headers (in file named VT.fasta) to reflect taxonomic information. Right now they have accession information only, like this:

gb|AJ854100_VTX00001

AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTAGGCGAGCGACTGCCG

gb|AF202299_VTX00002

GTAACGGGGAATTAGGGTTCCAATCCCGACACGGGGAGGTAGTGACAAT

I want the headers to include taxonomic information (Genus species) as well. This information is stored in a separate comma delineated csv file (VTtaxonomy.csv) like this:

GenBank code,VT,DNA,Class,Order,Family,Genus,Species

AJ854100,VTX00001,AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTAGGCGAGCGACTGCCG,Paraglomeromycetes,Paraglomerales,Paraglomeraceae,Paraglomus,sp.

AF202299,VTX00002,GTAACGGGGAATTAGGGTTCCAATCCCGACACGGGGAGGTAGTGACAAT,Paraglomeromycetes,Paraglomerales,Paraglomeraceae,Paraglomus,sp.

The fasta file and csv file are in the exact same order and there are 352 total header IDs that need to have the taxonomic information added on. I want the new fasta file (NewVT.fasta) to look like this:

gb|AJ854100_VTX00001| Paraglomus sp.

AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTAGGCGAGCGACTGCCG

gb|AF202299_VTX00002| Paraglomus sp.

GTAACGGGGAATTAGGGTTCCAATCCCGACACGGGGAGGTAGTGACAAT

I’m using Linux. This is probably a simple bash command (I hope!) but I just can’t figure it out. Thank you for your help!

FASTA headers taxonomic information sequence • 2.6k views

ADD COMMENT • link updated 6.2 years ago by Bastien Hervé 5.3k • written 6.2 years ago by mollysil ▴ 40

0

Entering edit mode

Any Perl solutions perhaps?

ADD REPLY • link 6.2 years ago by mollysil ▴ 40

score 1 · Answer 1 · 2018-02-28

1

Entering edit mode

6.2 years ago

Bastien Hervé 5.3k

Python solution :

from Bio import SeqIO
import csv

genus_species_array=[]

with open('VTtaxonomy.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    next(spamreader)
    for row in spamreader:
        genus_species_array.append(row[6]+" "+row[7])

count=0

for record in SeqIO.parse('VT.fasta', 'fasta'):
    record.id = record.id +"| "+ genus_species_array[count]
    record.description = ""
    with open("NewVT.fasta", "a") as output_handle:
        SeqIO.write(record, output_handle, 'fasta')
    count+=1

ADD COMMENT • link 6.2 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Thank you for such a detailed response. I assume there is a way to do this with a simple perl or python script? Or do you think I could use the join command maybe (can this be used with a FASTA and CSV file, as opposed to a TXT file)? I can't get access to Biopython on the server I'm using to run my data.

ADD REPLY • link 6.1 years ago by mollysil ▴ 40

0

Entering edit mode

I don't really get whats you really need, a perl script or the same python script without Biopython ?

In the first way, I'm not good enought in perl to help you. If you know how to Perl, I assume you can do the translation of the script above.

In the other way, you can try to modify the second part of my script and not using Biopython. You need to open the VT.fasta.

1) Loop over lines

2) foreach line starting with ">" you keep the name of the sequence

3) Then, according your fasta architecture (fasta is a wild animal : A: Is There A Precise Specification For Fasta Files? ) (sequence in one single line, sequence splited per 60 bases, 70 bases, even 80) you try to retrieve the complete sequence.

4) Then, same technique : you modify the name of the sequence, like I did here :

record.id +"| "+ genus_species_array[count]

5) You save name of your sequence and sequence in a new file.

6) Thereafter, go to step 2, etc...

There is maybe a way to do it in shell, good luck with that because the command will be huge and not easy to understand. I think that a script is way more understandable, but it's all about your convenience.

What do you mean by "as opposed to a TXT file" ?

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k