Question

Remove specific sequence group in fasta file

0

Entering edit mode

4.8 years ago

USER • 0

I have several files with the sequence of the organism or species and its reference sequence (CDS) and I would like to eliminate the reference sequences from them leaving only the sequence of the organism.

sequence genome R gene software error • 1.1k views

ADD COMMENT • link 4.4 years ago by USER • 0

score 1 · Answer 1 · 2020-10-11

1

Entering edit mode

4.8 years ago

GenoMax 152k

One way would be to linearize the fasta sequences (courtesy of @Pierre's gist which can be easily found by search for linearize fasta). Then grep "^gb" to keep the sequences you want and reformat back to fasta.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa  | grep "^>gb" | tr "\t" "\n" > final.fa

ADD COMMENT • link 4.8 years ago by GenoMax 152k

0

Entering edit mode

I would like to remove all that start with lcl. The reference CDS. If so, does it work too?

ADD REPLY • link 4.8 years ago by USER • 0

0

Entering edit mode

Try the above. It will only keep sequences that start with >gb.

ADD REPLY • link 4.8 years ago by GenoMax 152k