how to extract multiple fasta sequences from a file
3
0
Entering edit mode
6.8 years ago
HZZ0036 ▴ 30

Hi, I have a file includes both fasta sequences and non fasta sequences, like this;

 454 -      PolA   2436284                1.88
 454 -    1 CDSl   2436471 -   2436637   17.09   2436471 -   2436635    165
 454 -    2 CDSf   2436688 -   2436928   18.36   2436689 -   2436928    240
 454 -      TSS    2437349               -1.10
enter code here
 455 +      TSS    2439215                5.09
 455 +    1 CDSf   2439438 -   2439570   13.30   2439438 -   2439569    132

Predicted protein(s):
>FGENESH:   1   3 exon (s)     37  -   4224   154 aa, chain +
MCLADYAIICHREGTLHEVV``DPIIRDQIAPQCLRKFAEMTEQCVNEVGTTGGASALRAPG
AGPEAEAREKMCLADYAIICHREGTLHEAVDPIIRDQTRRNASGNSLRRQSNLINGHEAY
TTTARTHETRVVEETGDELANSAAFSQLVRPIGR
>FGENESH:   2   5 exon (s)   5130  -   6247   229 aa, chain -
MAPCQDIVDEGWGWERLVPCRFDGCVKWPDFKRYLVHYYHKNADKKVGELVGMRKPYPVE
QPDGATDDSLHAIVNQCIEAEYRFIRTCREKFTIDDFLLSRDITDRAKQLLQSGCESSIA
TVALLCITKEDELLCELFACQDISKALAFANVIRRSASNLMLFKGSESDAAGGGIMLGLA
REAEVALLAMHSGDEYAIANYITAVDARMRVPWCRCPVAMTTVSEVAAM

How to extract all fasta sequences? I want to get like a file only includes this:

>FGENESH:   1   3 exon (s)     37  -   4224   154 aa, chain +
MCLADYAIICHREGTLHEVVDPIIRDQIAPQCLRKFAEMTEQCVNEVGTTGGASALRAPG
AGPEAEAREKMCLADYAIICHREGTLHEAVDPIIRDQTRRNASGNSLRRQSNLINGHEAY
TTTARTHETRVVEETGDELANSAAFSQLVRPIGR
>FGENESH:   2   5 exon (s)   5130  -   6247   229 aa, chain -
MAPCQDIVDEGWGWERLVPCRFDGCVKWPDFKRYLVHYYHKNADKKVGELVGMRKPYPVE
QPDGATDDSLHAIVNQCIEAEYRFIRTCREKFTIDDFLLSRDITDRAKQLLQSGCESSIA
TVALLCITKEDELLCELFACQDISKALAFANVIRRSASNLMLFKGSESDAAGGGIMLGLA
REAEVALLAMHSGDEYAIANYITAVDARMRVPWCRCPVAMTTVSEVAAM

Thanks in advance.

sequence • 1.2k views
ADD COMMENT
1
Entering edit mode

Can you reformat your post? The site interprets > as the beginning of a quotation, so enclose your sequence information in code form using the button with 101010 on it

ADD REPLY
0
Entering edit mode

Are there always the same amount of header lines before the sequences start?

ADD REPLY
1
Entering edit mode
6.8 years ago

start writing after first '>'

awk '/^>/ {f=1;} {if(f==1) print;}'  file.fa
ADD COMMENT
0
Entering edit mode
6.8 years ago
venu 7.1k

Seems you have a space at the beginning of lines that are starting with digits. Following should work

grep -v '^\s' file.fa | sed -e '/enter/d' -e '/^$/d' -e '/Predicted/d'

If those lines are not starting with space, just replace '^\s' with '^[0-9]'.

ADD COMMENT
0
Entering edit mode

It worked, but there are some lines include "//", like:

>FGENESH: 199   5 exon (s) 1515013  - 1519188   619 aa, chain -
LRGSLGLRARDWPARSDPCSAWTGVTCRAGRVVALTVAGLRRTRRASLAPRLALDGLRNL
TALERFNASGFPLPGEIPAWFASGSGLPPPLAVLDLTSAGVNGTLPAGLGAASGNLTTLL
//
>FGENESH:   1   1 exon (s)   4483  -   4881   132 aa, chain +
MEEQHGGGRASNKIRDIVRLQQLLKKWKKLATVAPSSSSGKSSSVPRGSFAVYVGDEMRR
FVIPTEYLGHWAFAELLREAEEEFGFRHEGALRIPCDVEVFEGILRVVQGRKKDATDMCR
HSCSSETEILCR
......

How to delete '//' lines and change fasta file name using numbers? Like:

    >1
    LRGSLGLRARDWPARSDPCSAWTGVTCRAGRVVALTVAGLRRTRRASLAPRLALDGLRNL
    TALERFNASGFPLPGEIPAWFASGSGLPPPLAVLDLTSAGVNGTLPAGLGAASGNLTTL
    >2
    MEEQHGGGRASNKIRDIVRLQQLLKKWKKLATVAPSSSSGKSSSVPRGSFAVYVGDEMRR
    FVIPTEYLGHWAFAELLREAEEEFGFRHEGALRIPCDVEVFEGILRVVQGRKKDATDMCR
    HSCSSETEILCR
    .....
ADD REPLY
0
Entering edit mode

Renaming fasta headers is probably the most asked question on the forum, so search around a bit before asking a new question; which is what the second part of the comment is really... (best way to learn is by doing it yourself anyway!)

ADD REPLY
0
Entering edit mode
6.8 years ago
Joe 21k

This works on the test data, not sure whether other files would catch it out:

sed -n '/>/,/(>|\n)/p' testfile.txt

Prints everything between a > and either another > (so the next fasta) or a newline (to make sure the last fasta in the file is included)

ADD COMMENT

Login before adding your answer.

Traffic: 2802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6