Question

Remove sequences with stop codons from FASTA file

1

Entering edit mode

7.9 years ago

Jen ▴ 10

Hi all,

I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??

Thanks

sequence aminoacid • 5.6k views

ADD COMMENT • link updated 7.9 years ago by Pierre Lindenbaum 166k • written 7.9 years ago by Jen ▴ 10

0

Entering edit mode

Assuming that the file has linearized aa sequences in fasta format, following is the code with short example sequences with stop codon at different places:

$ cat test.fa 
>p1
ACDLA
>p2
ADCGLAGCTYLAKQ*
>P3
GTCTY*ATCG
>P4
*GAP
>P5
AGATE

code:

$ grep -B1 \*  test.fa | grep -vFf - test.fa

output:

$ grep -B1 \*  test.fa | grep -vFf - test.fa
>p1
ACDLA
>P5
AGATE

ADD REPLY • link 7.9 years ago by cpad0112 21k

GenoMax · Answer 1 · 2017-08-10

1

Entering edit mode

7.9 years ago

Pierre Lindenbaum 166k

linearize and filter with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t'  '!($2 ~ /\*/)' | tr "\t" "\n"

ADD COMMENT • link 7.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks Pierre. I linearized the fasta file using awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta > output.fasta and then used the awk command, awk -F '\t' '!($2 ~ /\*/)' input.fasta > output.fasta. This gives complete filtration of the * character from my file.

So then whats the tr command for???

ADD REPLY • link updated 7.9 years ago by GenoMax 152k • written 7.9 years ago by Jen ▴ 10

0

Entering edit mode

it convert the lines back to fasta (transform the '(title)(tab)(seq) to (title)(newline)(seq) see convert back to fasta

ADD REPLY • link 7.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you so much. Yes I got that right finally. Can I ask you one more query? What if I want the fasta file in the format of (title)(space)(seq)? Possible??