Question

Extract sequences based on keywords in headers (no perl or python code)

2

Entering edit mode

9.1 years ago

Crystal ▴ 70

Hi All,

I do find some similar questions, but I don't know how to apply the answers to solve my problem.

I guess it's very simple to everyone here.

I have a fast file containing 2000+ sequences with headers like:

>VFG0001 fhaB filamentous hemagglutinin/adhesin [FHA] [Bordetella pertussis]

the content in the last [ ] is the bacteria name.

Also for some other sequences, the header is like

>VFG2304 misL putative autotransporter [MisL] [Salmonella enterica (serovar typhimurium)]

So I just want to extract sequences that have headers containing key word [Escherichia coli].

I also want to extract sequences that have headers containing key word [Salmonella enterica] (I don't care about server within ()) into a separate fasta file.

How can I simply do this on Mac server (maybe not use python or perl, just simple script)?

Thank you

sequence • 4.1k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by Crystal ▴ 70

Ram · Answer 1 · 2015-09-23

3

Entering edit mode

9.1 years ago

harold.smith.tarheel ★ 5.0k

In Terminal:

grep -A 1 '[Escherichia coli' NAME_OF_FILE > E_coli.fasta

Assumes FASTA; change number to 3 for FASTQ. Will also work in cases (such as the Salmonella example) where strain information is present after the species name.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Thank you so much. It is very helpful. Just a little modification:

grep -A 1 'Escherichia coli' NAME_OF_FILE > E_coli.fasta

because with [ the terminal said unmatched [.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by Crystal ▴ 70

0

Entering edit mode

Hi ,actually there is a problem with the code. After extract the sequences, only the first line of sequences with header is shown in new file, so all the extracted sequences are not complete.

Well, I figured out sequences of the the original database are multiple lines instead of single line, so I just convert all sequences into a single line file and the problem solved.

ADD REPLY • link 9.1 years ago by Crystal ▴ 70

0

Entering edit mode

Or you could use Awk without needing to convert the multiple lines to single lines

https://infoplatter.wordpress.com/2013/10/15/extracting-specific-fasta-records-from-a-multi-fasta-file/comment-page-1/

ADD REPLY • link 9.1 years ago by Siva ★ 1.9k

0

Entering edit mode

Sorry, I've been spending too much time analyzing short-read sequences, and overlooked multi-line FASTA. Glad you worked out a solution.

ADD REPLY • link 9.1 years ago by harold.smith.tarheel ★ 5.0k