Question: Extract sequences based on keywords in headers (no perl or python code)
0
gravatar for Crystal
3.8 years ago by
Crystal30
United States
Crystal30 wrote:

Hi All,

I do find some similar questions, but I don't know how to apply the answers to solve my problem.

I guess it's very simple to everyone here.

I have a fast file containing 2000+ sequences with headers like:

>VFG0001 fhaB filamentous hemagglutinin/adhesin [FHA] [Bordetella pertussis]

the content in the last [ ] is the bacteria name.

Also for some other sequences, the header is like

>VFG2304 misL putative autotransporter [MisL] [Salmonella enterica (serovar typhimurium)]

 

So I just want to extract sequences that have headers containing key word [Escherichia coli].

I also want to extract sequences that have headers containing key word [Salmonella enterica] ( I don't care about server within ()) into a separate fasta file.

How can I simply do this on Mac server ( maybe not use python or perl, just simple script)?

Thank you.

 

sequence • 1.8k views
ADD COMMENTlink modified 3.8 years ago by harold.smith.tarheel4.4k • written 3.8 years ago by Crystal30
1
gravatar for harold.smith.tarheel
3.8 years ago by
United States
harold.smith.tarheel4.4k wrote:

In Terminal:

grep -A 1 '[Escherichia coli' NAME_OF_FILE > E_coli.fasta

Assumes FASTA; change number to 3 for FASTQ. Will also work in cases (such as the Salmonella example) where strain information is present after the species name.

ADD COMMENTlink written 3.8 years ago by harold.smith.tarheel4.4k

Thank you so much. It is very helpful. Just a little modification: 

grep -A 1 'Escherichia coli' NAME_OF_FILE > E_coli.fasta

because with "[" the terminal said "unmatched [ ".

ADD REPLYlink written 3.8 years ago by Crystal30

Hi ,actually there is a problem with the code. After extract the sequences, only the first line of sequences with header is shown in new file, so all the extracted sequences are not complete.

Well, I figured out sequences of the the original database are multiple lines instead of single line, so I just convert all sequences into a single line file and the problem solved.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Crystal30

Or you could use Awk without needing to convert the multiple lines to single lines

https://infoplatter.wordpress.com/2013/10/15/extracting-specific-fasta-records-from-a-multi-fasta-file/comment-page-1/

ADD REPLYlink written 3.8 years ago by Siva1.6k

Sorry, I've been spending too much time analyzing short-read sequences, and overlooked multi-line FASTA. Glad you worked out a solution.

ADD REPLYlink written 3.8 years ago by harold.smith.tarheel4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1458 users visited in the last hour