Question: Fasta header trimming
0
gravatar for leo1985.arnab
2.3 years ago by
leo1985.arnab10 wrote:

I have a fasta file with hundreds of sequences and their respective headers. The headers (all of them) are in the format

>ABCD [id_123] (gene_XYZ) [protein_ijk] [protein_id=qqq] [123..899]

.......seqeunce............

>EFGH [id_999] (gene_PQR) [protein_tre] [protein_id=trs] [573..789]
......seqeunce............

and so on.....

For the header every info in parenthesis are continuous and are only separated by a single space each (just as written above). All I want to do is retain "ABCD" (the very first info) in the header corresponding to every sequence . I want to loop through all the headers that are present in the file and return something like this :

>ABCD   
.....sequence.....
>EFGH  
.....sequence.......

and so on......Any help is most appreciated and i am working with BASH and perl.

Thank and regards!

sequence • 2.3k views
ADD COMMENTlink modified 2.3 years ago by genomax67k • written 2.3 years ago by leo1985.arnab10
1

With reformat.sh from BBMap suite: reformat.sh in=your.fa out=new.fa trd=t

ADD REPLYlink written 2.3 years ago by genomax67k

I dont know why the sequence is showing next to the header when i posted this here! Of course it is a fasta file and hence the sequences are directly below the headers.

ADD REPLYlink written 2.3 years ago by leo1985.arnab10

I have reformatted your post to show the correct format of fasta files.

ADD REPLYlink written 2.3 years ago by genomax67k

Okay. Thanks for that...

ADD REPLYlink written 2.3 years ago by leo1985.arnab10
8
gravatar for genomax
2.3 years ago by
genomax67k
United States
genomax67k wrote:

In bash: cut -d ' ' -f1 your_file.fa > new_file.fa

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by genomax67k

Worked well.....thanks!

ADD REPLYlink written 2.3 years ago by leo1985.arnab10
1
gravatar for Alex Reynolds
2.3 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If you want a fast and generic/flexible approach to parsing FASTA, use awk:

$ awk 'BEGIN{RS=">";}NR>1{ split($1,a," "); print ">"a[0]"\n"$2; }' in.fasta > out.fasta
ADD COMMENTlink written 2.3 years ago by Alex Reynolds28k

Thanks for all your replies ! However, I was looking to retain the actual sequences also along with the shortened header. Infact I was myself able to shorten the header to the first word before posting this here. BUt the problem was I could not figure out a way to shorten the header and keep the sequences intact. So, I was wondering if there is any way I could also retain the actual sequences along with the headers. In short- I want to retain the entire sequence intact for all the entries and just shorten the name of the headers.

Meaning :

ABCD [id_123] (gene_XYZ) [protein_ijk] [protein_id=qqq] [123..899] .......seqeunce............

will now look like

ABCD .....sequence.....

and I need to do this for the whole fasta file (all headers and sequences)

Thanks again!

ADD REPLYlink written 2.3 years ago by leo1985.arnab10

So you don't have a standard fasta format sequence file where first line is an identifier >some_id and the sequence follows on the second line? If you had said this yesterday then I would have reset the formatting. My apologies.

Your sequence is present on the same line as the identifier and you want to keep it on that line after shortening the header. Is that correct?

If the length of the extra stuff is always the same in all sequences then see if this works cut -d ' ' -f1,7 your.fa > new.fa

I am leaving this post here since child posts will disappear if I delete this. Content is no longer applicable.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax67k

sorry again genomax2, the formatting got disrupted again in my posting today. It is exactly the way it was posted yesterday. (top of my post) the sequence IS on the second line (beneath the header) as it should be in a regular fasta file

Alex,

Here is a snippet from my actual fasta file:

>ABCD_gene_1 [gene=XYZ] [location=1..2231]
AGAGTGTATTGATGAGTCTCCATGGGAATGTGAAATGGCTGAAATGTTTGAAGAAACTGTTTTAGGGGAT
AGGAGACTTGGGAGTATGTATGGGGTGGTTGGATTTGTCACATTGATCTTGTTGGGGACATGCATGTAAA
...............................

>PQRS_gene_2 [gene=PQR] [location=1..2618]
CGCGCGGCCCAGGAAGTGCGCGCTGTTGCCCCGGAAGTGCACGCGCGGGGTCACCGGAAGCGGCGCGTCG
GGAGGATGCCGCTGCCTGTCCAGGTCTTCAACCTGCAGGGTGCAGTGGAGCCCATGCAGATTGATGTGGA
........

The list goes on for another several hundred sequences. The final output that I am looking for is:

>ABCD_gene_1
AGAGTGTATTGATGAGTCTCCATGGGAATGTGAAATGGCTGAAATGTTTGAAGAAACTGTTTTAGGGGAT
AGGAGACTTGGGAGTATGTATGGGGTGGTTGGATTTGTCACATTGATCTTGTTGGGGACATGCATGTAAA
...............................

>PQRS_gene_2
CGCGCGGCCCAGGAAGTGCGCGCTGTTGCCCCGGAAGTGCACGCGCGGGGTCACCGGAAGCGGCGCGTCG
GGAGGATGCCGCTGCCTGTCCAGGTCTTCAACCTGCAGGGTGCAGTGGAGCCCATGCAGATTGATGTGGA
........

and so on for all thee headers and sequences... for the rest of the whole fasta file.

I am hoping the post will come up correctly formatted this time. Otherwise, please know that my file looks just like regular fasta file (as correctly formatted by genomax2 yesterday).Hope this helps!

Thanks !

ADD REPLYlink modified 2.3 years ago by genomax67k • written 2.3 years ago by leo1985.arnab10

For future reference, it is safer to use the "code" formatting tool (101010 button) when formatting things like code/file formats. I have done this for you (and also reset the format original question).

So you do have a normal fasta file (since there is no formal spec for fasta this would do).

Edit: @Alex's solution as posted above did not work with the example data posted today (which is not the same as the original post).

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax67k

Thanks genomax2 ! Infact your posted asnwer (cut command) from yesterday is doing the job fine. I must have erred on something, which gave me a different result last night and I appreciate you pointing out the formatting tool button. I am new to the forum and I will surely take care of these stuff before posting next time.

My issue is solved- and thanks to all who took time in contributing your answers. I learnt several new ways in managing such scripting situations for the future. Thanks everyone!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by leo1985.arnab10

My awk statement should shorten headers and preserve sequences in FASTA files. I'm unclear what the issue is on your side, but if you want to post a snippet of your file and what results you're getting, I'd be happy to try to help.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Alex Reynolds28k
0
gravatar for shenwei356
2.3 years ago by
shenwei3564.6k
China
shenwei3564.6k wrote:

Lightning fast solution using seqkit (usage of seqkit seq), just download the binary file for Linux, decompress and immediately run:

./seqkit seq -i seqs.fa > seqs2.fa
ADD COMMENTlink written 2.3 years ago by shenwei3564.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 929 users visited in the last hour