Question: Remove sequences with stop codons from FASTA file
1
gravatar for Jen
2.2 years ago by
Jen10
Jen10 wrote:

Hi all,

I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??

Thanks

aminoacid sequence • 1.9k views
ADD COMMENTlink modified 2.2 years ago by Pierre Lindenbaum123k • written 2.2 years ago by Jen10

Assuming that the file has linearized aa sequences in fasta format, following is the code with short example sequences with stop codon at different places:

$ cat test.fa 
>p1
ACDLA
>p2
ADCGLAGCTYLAKQ*
>P3
GTCTY*ATCG
>P4
*GAP
>P5
AGATE

code:

$ grep -B1 \*  test.fa | grep -vFf - test.fa

output:

$ grep -B1 \*  test.fa | grep -vFf - test.fa
>p1
ACDLA
>P5
AGATE
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by cpad011212k
1
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

linearize and filter with awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t'  '!($2 ~ /\*/)' | tr "\t" "\n"
ADD COMMENTlink written 2.2 years ago by Pierre Lindenbaum123k

Thanks Pierre. I linearized the fasta file using awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta > output.fasta and then used the awk command, awk -F '\t' '!($2 ~ /\*/)' input.fasta > output.fasta. This gives complete filtration of the * character from my file.

So then whats the tr command for???

ADD REPLYlink modified 2.2 years ago by genomax72k • written 2.2 years ago by Jen10

it convert the lines back to fasta (transform the '(title)(tab)(seq) to (title)(newline)(seq) see convert back to fasta

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum123k

Thank you so much. Yes I got that right finally. Can I ask you one more query? What if I want the fasta file in the format of (title)(space)(seq)? Possible??

ADD REPLYlink written 2.2 years ago by Jen10
1

please, don't be lazy and try to understand how this works.

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum123k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2038 users visited in the last hour