Running Glimmer: Training On Closely Related Species Sequences
1
1
Entering edit mode
12.6 years ago
Pasta ★ 1.3k

Hi,

I got several contigs obtained from the sequencing of a bacterial strain. I would like to make ORF prediction using GLIMMER and perform the training on the genes of a closely related species. The problem is that I cannot figure out how to do that. I read the documentation but I still very confused as there is no description of the input file formats...or I missed something.

Anyway, if someone could explain how to perform the training on a genes of a related species, and then use this training on my new bug I would appreciate.

Thank you.

orf gene • 8.1k views
ADD COMMENT
6
Entering edit mode
12.6 years ago
Neilfws 49k

GLIMMER3 is designed to be flexible; it contains several tools that can be combined in different ways to build pipelines. This does make the documentation somewhat confusing.

I'll assume that you've downloaded, extracted and built GLIMMER3 successfully:

wget http://www.cbcb.umd.edu/software/glimmer/glimmer302.tar.gz
tar zxvf glimmer302.tar.gz
cd glimmer3.02/src
make

The make step often fails because GLIMMER3 is old (2006) and newer versions of GCC fail. Apply the patch from near the end of this thread if that happens.

Next, carefully examine the script g3-from-training.csh in the scripts directory and the output in the sample-run directory. This will help you understand how GLIMMER3 works.

Training on another genome is quite simple:

build-icm [options] icm_file < input-file

The input file is just a set of coding sequences in FASTA format from your related species. Look at the file from-training.train in the sample-run directory for an example. Each sequence should contain start, but not stop codons.

You then run glimmer3, supplying your contig sequence (again in FASTA format and here called sequence-file) and the icm_file as arguments along with any options:

glimmer3 [options] sequence-file icm-file tag

Here tag is just the prefix for the output files. One useful option is -b pwm-file, where you can supply a PWM to identify ribosome binding sites, which improves accuracy. The g3-from-training.csh script uses the ELPH program to do this and awk to process output.

Options for any of the compiled programs can be viewed by appending --help to the program name.

Hope this helps you to get started.

ADD COMMENT
0
Entering edit mode

Thank you for this comprehensive answer Neil !

ADD REPLY

Login before adding your answer.

Traffic: 882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6