Remove specific sequence group in fasta file
1
0
19 months ago
USER • 0

I have several files with the sequence of the organism or species and its reference sequence (CDS) and I would like to eliminate the reference sequences from them leaving only the sequence of the organism.

1
19 months ago
GenoMax 115k

One way would be to linearize the fasta sequences (courtesy of @Pierre's gist which can be easily found by search for linearize fasta). Then grep "^gb" to keep the sequences you want and reformat back to fasta.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa  | grep "^>gb" | tr "\t" "\n" > final.fa

0
I would like to remove all that start with lcl. The reference CDS. If so, does it work too?

0
Try the above. It will only keep sequences that start with >gb.