remove fasta file heading
4
0
Entering edit mode
9.6 years ago
ravihansa82 ▴ 130

Dear friends I need a help from you to remove some part of fasta file heading.

For example I need to keep only gene id (ENSG00000026652) so that I need to remove all other parts right to the pipe

Since I am having about 4000 sequences I need to do this with the help of some programming

>ENSG00000026652|ENST00000437165;ENST00000320285;ENST00000366911;ENST00000436279;ENST00000366905
TTTGACACTTGCATAGCTGTTAGTAATTCTGCAATGTGCTTGGGCTATTTGTGAGCACAT
GTATTTTTCCTTTATAGATTATAAACATCTAAAGAACAAGGTTAACCCAGGAGTCAAGTA
AATAGTTAAATATTATTTTGACAATGGCTGTAATAGTGGACATTTGAAAGGAATACACCT
CAGTATTTTGAAATTGAAATAATTTTCTAGATCCTGGCATTTCTGGACTTTCAACAGCCC

Desired output

>ENSG00000026652
TTTGACACTTGCATAGCTGTTAGTAATTCTGCAATGTGCTTGGGCTATTTGTGAGCACAT
GTATTTTTCCTTTATAGATTATAAACATCTAAAGAACAAGGTTAACCCAGGAGTCAAGTA
AATAGTTAAATATTATTTTGACAATGGCTGTAATAGTGGACATTTGAAAGGAATACACCT
CAGTATTTTGAAATTGAAATAATTTTCTAGATCCTGGCATTTCTGGACTTTCAACAGCCC

I am bit familiar with the perl so can somebody help me to do this?

sequencing gene • 9.5k views
ADD COMMENT
5
Entering edit mode
9.6 years ago
Jorjial ▴ 300

If you now that the part you want to delete starts with "|", you can use sed in the terminal:

sed 's/|.*$//' file.fasta > new_file.fasta

What you are doing is to delete all the characters from "|" to the end of the line.

I hope this helps

ADD COMMENT
0
Entering edit mode

thank you... Jorjial that has perfectly worked....

ADD REPLY
0
Entering edit mode

Hello, I wonder how to use this command to delete all the character from"|" to the left end of the line? Thanks!

ADD REPLY
0
Entering edit mode

Thanks a lot ! Its work very well for my multifasta file that Ive been looking for two days.

ADD REPLY
2
Entering edit mode
9.6 years ago
edrezen ▴ 730

Hello,

Not a perl answer, but you can do it with the 'gawk' command (or 'awk' if you don't have gawk)

If the separator is only the pipe character, you can do the following:

gawk 'BEGIN{FS="|"}{print $1}' reads.fa
ADD COMMENT
1
Entering edit mode

You should get rid of cat part.

ADD REPLY
0
Entering edit mode

I just did that for edrezen

ADD REPLY
2
Entering edit mode
9.6 years ago
5heikki 11k
cut -f1 -d "|" input > output
ADD COMMENT
0
Entering edit mode

Hello, how can I use this command to remove strings all the left from "|"? Thank you.

ADD REPLY
1
Entering edit mode

You mean keep everything right from pipe?

cut -f2- -d "|" input > output

You should:

man cut

To see:

Numbers or number ranges may be fol-
 lowed by a dash, which selects all fields or columns from the last number to the end of the line.
ADD REPLY
0
Entering edit mode

Thank you 5heikki. It works, but not as what I expected. I am sorry I did make it clear. What I want is everything right from pipe and the ">". For example, my header is: >PacBio|Sequence0001 what I want is: >Sequence0001

Thanks so much for your response.

ADD REPLY
0
Entering edit mode

Can't do that with cut.

awk 'BEGIN{FS="|"}{if(/^>/){print ">"$2}else{print $0}}' input > output
ADD REPLY
1
Entering edit mode
9.6 years ago

Because I think perl is a terrible language, here's how to do it with biopython:

#!/usr/bin/env python
from Bio import SeqIO
import re

of = open("foobar.fa", "w")
for record in SeqIO.parse("foo.fa", "fasta") :
    matches = re.search("(ENSG[\d]+)", record.id)
    if(matches != None) :
        record.id = matches.group(1);
        record.description=""
    SeqIO.write(record, of, "fasta")
of.close()

You could easily translate that to either perl or bioperl. You could also do this with a sed or awk one-liner.

ADD COMMENT
0
Entering edit mode

Out of interest, why do you think Perl is a terrible language?

ADD REPLY
0
Entering edit mode

It makes it too easy for people to write opaque unsupportable code.

ADD REPLY

Login before adding your answer.

Traffic: 2456 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6