Hi all,
I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??
Thanks
Hi all,
I have some fasta file with ORF (aminoacid) sequences with around 500,000 sequences. However, I need to remove those sequences with "*" stop codons. Is it possible using some sed, cat, awk, etc commandline??
Thanks
linearize and filter with awk
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk -F '\t' '!($2 ~ /\*/)' | tr "\t" "\n"
Thanks Pierre.
I linearized the fasta file using awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta > output.fasta
and then used the awk command, awk -F '\t' '!($2 ~ /\*/)' input.fasta > output.fasta
. This gives complete filtration of the * character from my file.
So then whats the tr command for???
it convert the lines back to fasta (transform the '(title)(tab)(seq) to (title)(newline)(seq) see convert back to fasta
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Assuming that the file has linearized aa sequences in fasta format, following is the code with short example sequences with stop codon at different places:
code:
output: