Extract sequences based on keywords in headers (no perl or python code)
1
2
Entering edit mode
9.0 years ago
Crystal ▴ 70

Hi All,

I do find some similar questions, but I don't know how to apply the answers to solve my problem.

I guess it's very simple to everyone here.

I have a fast file containing 2000+ sequences with headers like:

>VFG0001 fhaB filamentous hemagglutinin/adhesin [FHA] [Bordetella pertussis]

the content in the last [ ] is the bacteria name.

Also for some other sequences, the header is like

>VFG2304 misL putative autotransporter [MisL] [Salmonella enterica (serovar typhimurium)]

So I just want to extract sequences that have headers containing key word [Escherichia coli].

I also want to extract sequences that have headers containing key word [Salmonella enterica] (I don't care about server within ()) into a separate fasta file.

How can I simply do this on Mac server (maybe not use python or perl, just simple script)?

Thank you

sequence • 4.0k views
ADD COMMENT
3
Entering edit mode
9.0 years ago

In Terminal:

grep -A 1 '[Escherichia coli' NAME_OF_FILE > E_coli.fasta

Assumes FASTA; change number to 3 for FASTQ. Will also work in cases (such as the Salmonella example) where strain information is present after the species name.

ADD COMMENT
0
Entering edit mode

Thank you so much. It is very helpful. Just a little modification:

grep -A 1 'Escherichia coli' NAME_OF_FILE > E_coli.fasta

because with [ the terminal said unmatched [.

ADD REPLY
0
Entering edit mode

Hi ,actually there is a problem with the code. After extract the sequences, only the first line of sequences with header is shown in new file, so all the extracted sequences are not complete.

Well, I figured out sequences of the the original database are multiple lines instead of single line, so I just convert all sequences into a single line file and the problem solved.

ADD REPLY
0
Entering edit mode

Or you could use Awk without needing to convert the multiple lines to single lines

https://infoplatter.wordpress.com/2013/10/15/extracting-specific-fasta-records-from-a-multi-fasta-file/comment-page-1/

ADD REPLY
0
Entering edit mode

Sorry, I've been spending too much time analyzing short-read sequences, and overlooked multi-line FASTA. Glad you worked out a solution.

ADD REPLY

Login before adding your answer.

Traffic: 1734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6