4
0
Entering edit mode
7.3 years ago
ravihansa82 ▴ 100

Dear friends I need a help from you to remove some part of fasta file heading.

For example I need to keep only gene id (ENSG00000026652) so that I need to remove all other parts right to the pipe

Since I am having about 4000 sequences I need to do this with the help of some programming

>ENSG00000026652|ENST00000437165;ENST00000320285;ENST00000366911;ENST00000436279;ENST00000366905
TTTGACACTTGCATAGCTGTTAGTAATTCTGCAATGTGCTTGGGCTATTTGTGAGCACAT
GTATTTTTCCTTTATAGATTATAAACATCTAAAGAACAAGGTTAACCCAGGAGTCAAGTA
AATAGTTAAATATTATTTTGACAATGGCTGTAATAGTGGACATTTGAAAGGAATACACCT
CAGTATTTTGAAATTGAAATAATTTTCTAGATCCTGGCATTTCTGGACTTTCAACAGCCC


Desired output

>ENSG00000026652
TTTGACACTTGCATAGCTGTTAGTAATTCTGCAATGTGCTTGGGCTATTTGTGAGCACAT
GTATTTTTCCTTTATAGATTATAAACATCTAAAGAACAAGGTTAACCCAGGAGTCAAGTA
AATAGTTAAATATTATTTTGACAATGGCTGTAATAGTGGACATTTGAAAGGAATACACCT
CAGTATTTTGAAATTGAAATAATTTTCTAGATCCTGGCATTTCTGGACTTTCAACAGCCC


I am bit familiar with the perl so can somebody help me to do this?

sequencing gene • 8.1k views
4
Entering edit mode
7.3 years ago
Jorjial ▴ 280

If you now that the part you want to delete starts with "|", you can use sed in the terminal:

sed 's/|.*$//' file.fasta > new_file.fasta  What you are doing is to delete all the characters from "|" to the end of the line. I hope this helps ADD COMMENT 0 Entering edit mode thank you... Jorjial that has perfectly worked.... ADD REPLY 0 Entering edit mode Hello, I wonder how to use this command to delete all the character from"|" to the left end of the line? Thanks! ADD REPLY 0 Entering edit mode Thanks a lot ! Its work very well for my multifasta file that Ive been looking for two days. ADD REPLY 2 Entering edit mode 7.3 years ago edrezen ▴ 720 Hello, Not a perl answer, but you can do it with the 'gawk' command (or 'awk' if you don't have gawk) If the separator is only the pipe character, you can do the following: gawk 'BEGIN{FS="|"}{print$1}' reads.fa

1
Entering edit mode

You should get rid of cat part.

0
Entering edit mode

I just did that for edrezen

2
Entering edit mode
7.3 years ago
5heikki 10k
cut -f1 -d "|" input > output

0
Entering edit mode

Hello, how can I use this command to remove strings all the left from "|"? Thank you.

1
Entering edit mode

You mean keep everything right from pipe?

cut -f2- -d "|" input > output


You should:

man cut


To see:

Numbers or number ranges may be fol-
lowed by a dash, which selects all fields or columns from the last number to the end of the line.

0
Entering edit mode

Thank you 5heikki. It works, but not as what I expected. I am sorry I did make it clear. What I want is everything right from pipe and the ">". For example, my header is: >PacBio|Sequence0001 what I want is: >Sequence0001

Thanks so much for your response.

0
Entering edit mode

Can't do that with cut.

awk 'BEGIN{FS="|"}{if(/^>/){print ">"$2}else{print$0}}' input > output

1
Entering edit mode
7.3 years ago

Because I think perl is a terrible language, here's how to do it with biopython:

#!/usr/bin/env python
from Bio import SeqIO
import re

of = open("foobar.fa", "w")
for record in SeqIO.parse("foo.fa", "fasta") :
matches = re.search("(ENSG[\d]+)", record.id)
if(matches != None) :
record.id = matches.group(1);
record.description=""
SeqIO.write(record, of, "fasta")
of.close()


You could easily translate that to either perl or bioperl. You could also do this with a sed or awk one-liner.

0
Entering edit mode

Out of interest, why do you think Perl is a terrible language?

0
Entering edit mode

It makes it too easy for people to write opaque unsupportable code.