Remove sequences with stop codons from FASTA file
1
1
Entering edit mode
4.2 years ago
Jen ▴ 10

Hi all,

I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??

Thanks

sequence aminoacid • 3.4k views
ADD COMMENT
0
Entering edit mode

Assuming that the file has linearized aa sequences in fasta format, following is the code with short example sequences with stop codon at different places:

$ cat test.fa 
>p1
ACDLA
>p2
ADCGLAGCTYLAKQ*
>P3
GTCTY*ATCG
>P4
*GAP
>P5
AGATE

code:

$ grep -B1 \*  test.fa | grep -vFf - test.fa

output:

$ grep -B1 \*  test.fa | grep -vFf - test.fa
>p1
ACDLA
>P5
AGATE
ADD REPLY
1
Entering edit mode
4.2 years ago

linearize and filter with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t'  '!($2 ~ /\*/)' | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Thanks Pierre. I linearized the fasta file using awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta > output.fasta and then used the awk command, awk -F '\t' '!($2 ~ /\*/)' input.fasta > output.fasta. This gives complete filtration of the * character from my file.

So then whats the tr command for???

ADD REPLY
0
Entering edit mode

it convert the lines back to fasta (transform the '(title)(tab)(seq) to (title)(newline)(seq) see convert back to fasta

ADD REPLY
0
Entering edit mode

Thank you so much. Yes I got that right finally. Can I ask you one more query? What if I want the fasta file in the format of (title)(space)(seq)? Possible??

ADD REPLY
1
Entering edit mode

please, don't be lazy and try to understand how this works.

ADD REPLY

Login before adding your answer.

Traffic: 2190 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6