Remove specific sequence group in fasta file
1
0
Entering edit mode
3.5 years ago
USER • 0

I have several files with the sequence of the organism or species and its reference sequence (CDS) and I would like to eliminate the reference sequences from them leaving only the sequence of the organism.

sequence genome R gene software error • 857 views
ADD COMMENT
1
Entering edit mode
3.5 years ago
GenoMax 141k

One way would be to linearize the fasta sequences (courtesy of @Pierre's gist which can be easily found by search for linearize fasta). Then grep "^gb" to keep the sequences you want and reformat back to fasta.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa  | grep "^>gb" | tr "\t" "\n" > final.fa
ADD COMMENT
0
Entering edit mode

I would like to remove all that start with lcl. The reference CDS. If so, does it work too?

ADD REPLY
0
Entering edit mode

Try the above. It will only keep sequences that start with >gb.

ADD REPLY

Login before adding your answer.

Traffic: 1628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6