Question: Remove sequences with stop codons from FASTA file
0
gravatar for Jen
18 months ago by
Jen0
Jen0 wrote:

Hi all,

I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??

Thanks

aminoacid sequence • 1.5k views
ADD COMMENTlink modified 18 months ago by Pierre Lindenbaum116k • written 18 months ago by Jen0

Assuming that the file has linearized aa sequences in fasta format, following is the code with short example sequences with stop codon at different places:

$ cat test.fa 
>p1
ACDLA
>p2
ADCGLAGCTYLAKQ*
>P3
GTCTY*ATCG
>P4
*GAP
>P5
AGATE

code:

$ grep -B1 \*  test.fa | grep -vFf - test.fa

output:

$ grep -B1 \*  test.fa | grep -vFf - test.fa
>p1
ACDLA
>P5
AGATE
ADD REPLYlink modified 18 months ago • written 18 months ago by cpad011211k
1
gravatar for Pierre Lindenbaum
18 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

linearize and filter with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t'  '!($2 ~ /\*/)' | tr "\t" "\n"
ADD COMMENTlink written 18 months ago by Pierre Lindenbaum116k

Thanks Pierre. I linearized the fasta file using awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta > output.fasta and then used the awk command, awk -F '\t' '!($2 ~ /\*/)' input.fasta > output.fasta. This gives complete filtration of the * character from my file.

So then whats the tr command for???

ADD REPLYlink modified 18 months ago by genomax62k • written 18 months ago by Jen0

it convert the lines back to fasta (transform the '(title)(tab)(seq) to (title)(newline)(seq) see convert back to fasta

ADD REPLYlink modified 18 months ago • written 18 months ago by Pierre Lindenbaum116k

Thank you so much. Yes I got that right finally. Can I ask you one more query? What if I want the fasta file in the format of (title)(space)(seq)? Possible??

ADD REPLYlink written 18 months ago by Jen0
1

please, don't be lazy and try to understand how this works.

ADD REPLYlink written 18 months ago by Pierre Lindenbaum116k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2497 users visited in the last hour