Question: How to rename headers in fasta files keeping some fractions and adding a name?
0
gravatar for mirza
2.4 years ago by
mirza80
India
mirza80 wrote:

Hi,

I have different fasta files. I want to keep some part of the headers and add a name to simplify the downstream analysis and since the ids in files are not in continuation, so simply renaming in series using awk won't help. Some of my fasta headers are like this (augustus output file)

>g1134t1 geneg1134

I want to keep the header and just add the species_genus name after >

or better like this

>Species_genus gene1134

Similarly, for file with headers like this,

>AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I want to keep >AG1IA_00006

p.s. my OS= Ubuntu16.04

p.p.s. I couldn't find a suitable command in the other similar posts and I also asked there but couldn't get any help. It's a bit urgent.

Thanks in advance.

fasta files renaming headers • 1.7k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by mirza80

On Ubuntu you can use the sed command to remove anything followed by a space.

sed -e 's/ .*//g' test.fa
ADD REPLYlink written 2.4 years ago by Sej Modha4.2k

For future reference you can use this book to learn basic Unix and Perl.

http://korflab.ucdavis.edu/Unix_and_Perl/current.pdf

ADD REPLYlink written 2.4 years ago by Sej Modha4.2k

@Sej

Thank you very much for the document and the answer. Let me try the command.

ADD REPLYlink written 2.4 years ago by mirza80

I want to keep the header and just add the species_genus name after > or better like this Species_genus gene1134

That is not necessarily a good idea, a lot of tools need a unique sequence identifier. Where do you get the species name from by the way?

ADD REPLYlink written 2.4 years ago by Michael Dondrup46k

well, we sequenced and assembled a few genomes, so for the ease of identification, I want to add the respective species_genus name. Right now I want to name the sequences this way for orthofinder and related analysis. It will be easier to visualize the orthlogs/ paralogs. I am keeping the original files for other analysis/ tools.

ADD REPLYlink written 2.4 years ago by mirza80

I would try smth like sed -e 's/>/>species_name_/g' the > is not supposed to occur anywhere else in a fasta file, that way you get both species name and unique id.

>blah blubb
>species_name_blah blubb
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Michael Dondrup46k

Thanks Michael. I'll try tomorrow and let you know.

ADD REPLYlink written 2.4 years ago by mirza80

@Michael

Hi, it did work, thank you. But, what if I want to keep one out of the two terms here. For

g1134t1 geneg1134, I want to keep

Species_genus g1134

and for

AG1IA_00006 contig1:1338:4722:+ [translate_table: standard]

I just want to keep >AG1IA_00006

I did searched for sed. Also in the pdf sent above by Sej. But I could only find, That ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes). Couldn't find an option to delete some parts selectively. I am a newie and will be grateful if you can help. thanks.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by mirza80
0
gravatar for mirza
2.4 years ago by
mirza80
India
mirza80 wrote:

I am writing my answer will hopefully help newbies like me. I finally used Fasta manipulation in Galaxy. Used fasta to tab function to convert my files to tabular format, open it in excel, did the necessary changes and converted back it to fasta using Tabular to Fasta function!

ADD COMMENTlink written 2.4 years ago by mirza80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 906 users visited in the last hour