Question: Remove sequences with stop codons from FASTA file
1
gravatar for Jen
22 months ago by
Jen10
Jen10 wrote:

Hi all,

I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??

Thanks

aminoacid sequence • 1.7k views
ADD COMMENTlink modified 22 months ago by Pierre Lindenbaum120k • written 22 months ago by Jen10

Assuming that the file has linearized aa sequences in fasta format, following is the code with short example sequences with stop codon at different places:

$ cat test.fa 
>p1
ACDLA
>p2
ADCGLAGCTYLAKQ*
>P3
GTCTY*ATCG
>P4
*GAP
>P5
AGATE

code:

$ grep -B1 \*  test.fa | grep -vFf - test.fa

output:

$ grep -B1 \*  test.fa | grep -vFf - test.fa
>p1
ACDLA
>P5
AGATE
ADD REPLYlink modified 22 months ago • written 22 months ago by cpad011211k
1
gravatar for Pierre Lindenbaum
22 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

linearize and filter with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t'  '!($2 ~ /\*/)' | tr "\t" "\n"
ADD COMMENTlink written 22 months ago by Pierre Lindenbaum120k

Thanks Pierre. I linearized the fasta file using awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta > output.fasta and then used the awk command, awk -F '\t' '!($2 ~ /\*/)' input.fasta > output.fasta. This gives complete filtration of the * character from my file.

So then whats the tr command for???

ADD REPLYlink modified 22 months ago by genomax68k • written 22 months ago by Jen10

it convert the lines back to fasta (transform the '(title)(tab)(seq) to (title)(newline)(seq) see convert back to fasta

ADD REPLYlink modified 22 months ago • written 22 months ago by Pierre Lindenbaum120k

Thank you so much. Yes I got that right finally. Can I ask you one more query? What if I want the fasta file in the format of (title)(space)(seq)? Possible??

ADD REPLYlink written 22 months ago by Jen10
1

please, don't be lazy and try to understand how this works.

ADD REPLYlink written 22 months ago by Pierre Lindenbaum120k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 927 users visited in the last hour