Question: Reformatting fasta headers
1
gravatar for jack1120
2.6 years ago by
jack112030
University of Minnesota
jack112030 wrote:

I need to reformat headers in a fasta file with headers such as:

>Agaricus_chiangmaiensis|JF514531|SH174817.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Agaricales;f__Agaricaceae;g__Agaricus;s__Agaricus_chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>Acarospora_laqueata|DQ842014|SH191965.07FU|refs|k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora_laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>Ceratobasidiaceae_sp|DQ493566|SH185440.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Cantharellales;f__Ceratobasidiaceae;g__unidentified;s__Ceratobasidiaceae_sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC

So that they look like:

>SH174817.07FU Agaricus chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>SH191965.07FU Acarospora laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>SH185440.07FU Ceratobasidiaceae sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC

Is there a relatively simple code that can isolate these specific elements and re-order them? I think I can get the first part with something like:

grep -r -o "SH.*FU" file.fasta

But I am unsure how to isolate and reformat the genus and species names in addition to that.

sequencing next-gen headers fasta • 1.0k views
ADD COMMENTlink modified 2.6 years ago by Pierre Lindenbaum129k • written 2.6 years ago by jack112030
1

This is the most asked question on BioStars, I’d suggest you start with the search box on this site.

My answer in this thread for example, will do what you want (with a little tweaking, and assuming your fasta’s are linear).

A: Fasta header trimming for multiple delimiters

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Joe17k

That's fair. I understand the frustration and apologize for the poor etiquette. I did search some general programming sites beforehand, but lazily plopped my question here looking a quick fix after that. I'll be better!

ADD REPLYlink written 2.6 years ago by jack112030

Not really a bioinformatics question, more of a programming one. Using your favorite scripting language, extract the header, split the content on the | separator and output what you need.

ADD REPLYlink written 2.6 years ago by Jean-Karim Heriche23k
2
gravatar for Alex Reynolds
2.6 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Given in.fa:

$ more in.fa
>Agaricus_chiangmaiensis|JF514531|SH174817.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Agaricales;f__Agaricaceae;g__Agaricus;s__Agaricus_chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>Acarospora_laqueata|DQ842014|SH191965.07FU|refs|k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora_laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>Ceratobasidiaceae_sp|DQ493566|SH185440.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Cantharellales;f__Ceratobasidiaceae;g__unidentified;s__Ceratobasidiaceae_sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC

Here's one way:

$ awk '{ if ($0~/^>/) { n=split($0, a, "|"); gsub(/_/," ", a[1]); printf(">%s %s\n", a[3], substr(a[1], 2)); } else { print $0; } }' in.fa
>SH174817.07FU Agaricus chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>SH191965.07FU Acarospora laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>SH185440.07FU Ceratobasidiaceae sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC
ADD COMMENTlink written 2.6 years ago by Alex Reynolds30k
1

This works perfectly. Thank you, Alex!

ADD REPLYlink written 2.6 years ago by jack112030
2
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:
sed '/^>/s/>\([^|]*\)|[^\|]*|\([^|]*\)|.*/>\2 \1/;/^>/s/_/ /g' in.fasta
ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum129k
2

Your cat walked on your keyboard?

ADD REPLYlink written 2.6 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1064 users visited in the last hour