How do I use Glimmer 3.02?
0
0
Entering edit mode
4.5 years ago
nattzy94 ▴ 50

Sorry, this is a really basic question. I've downloaded Glimmer 3.02 and installed it according to the instructions found in the notes but have no idea how to input my gene sequence into the software or how to use it all for that matter.

Hope someone can help.

sequencing • 5.3k views
ADD COMMENT
1
Entering edit mode

This PDF should also be in your local software download.

ADD REPLY
0
Entering edit mode

Hi genomax, thanks for the reply. I've read the document and got up to the installation bit but can't understand how to run it. For the first step - building the ICM - how do I get those sequences and how do I input it into the program?

Also, not sure if the issue is because I am using a mac and it might not be compatible.

ADD REPLY
1
Entering edit mode

See: Running Glimmer: Training On Closely Related Species Sequences

While the directions say the program is macOS compatible it is almost 10+ years old at this time and macOS has undergone many changes since that time. If you were able to get the program to compile and run you are probably fine to proceed.

ADD REPLY
0
Entering edit mode

OK, thanks a lot. I will work through it. Just to check am i supposed to be working through the terminal window that opens when I open build-icm? Cos I can't seem to type anything on the terminal.

ADD REPLY
1
Entering edit mode

This is a command line program that you have to run through terminal. Did you get the source code and compile the program on macOS? A linux binary will not work on macOS as is.

ADD REPLY
0
Entering edit mode

I downloaded Glimmer from https://ccb.jhu.edu/software/glimmer/ onto my mac and followed the installation instructions. Is that correct?

ADD REPLY
1
Entering edit mode

That is correct. After compiling the program do you have an executable called glimmer that you are able to run and produce help output from?

./glimmer3 -h
ADD REPLY
0
Entering edit mode

Yes, I get a whole bunch of unix executable files in the 'bin' folder. However, when I open them, all I get is a terminal window in which I can't type anything.

screenshot of the terminal window after opening Glimmer: https://ibb.co/hLsbSx

ADD REPLY
1
Entering edit mode

Looks like you have managed to get the package compiled.

You can't double click these executables and run them (like you normally would other GUI based programs). Instead you have to open the terminal program and use them in the terminal window itself (apple key + space --> search for terminal and then open terminal program).

Using glimmer3 via command line is going to require basic understanding of unix command line (I am going to guess that you are not familiar with this). If yes, then I am going to suggest that you spend a half-day going over basics of UNIX for biologists using this excellent tutorial. It would be the best time investment for future.

BTW: What are you planning to use glimmer for? There may be newer tools you could use instead.

ADD REPLY
0
Entering edit mode

Ok, thanks for the help so far. Will take a look at the tutorial.

I am trying to replicate methods for gene prediction and functional annotation in this paper: http://aem.asm.org/content/82/24/7063.full

ADD REPLY
0
Entering edit mode

Could you suggest some of the newer tools? For bacterial gene finding and annotation, I tried Prokka but it doesn't seem to work well (predicts way too many CDS). So I'm thinking of going back to tried and trusted glimmer.

ADD REPLY
1
Entering edit mode

Ive never had an issue with prokka before, are you sure you dont have some contaminant in your assemblies or something?

ADD REPLY
0
Entering edit mode

I aligned to my known reference (E. coli) and visually everything seemed okay. But then Prokka got ~10000 CDS whereas one should only expect 4000-5000. Would the fact that my assembly is only based on nanopore reads have something to do with it? I'm guessing the indel errors could cause a lot of frameshifts and produce spurious ORFs (about half of the predicted genes are 'hypothetical proteins' so subtracting those would bring me to a more realistic figure). But then again, would frameshifts cause a literal doubling of predicted genes? Possible, but unlikely.

Edit: the hypothetical proteins are mostly quite short (below 500 bp)

ADD REPLY
1
Entering edit mode

That seems possible, but I'm not overly familiar with nanopore data. That number would seem on the high side to just be from frameshifts etc.

What is the assembly like (N50 etc)? Prokka normally errs on the conservative side when calling genes, so I strongly suspect it isn't prokka that's the issue here.

(This may be worth opening a new forum question for).

ADD REPLY
0
Entering edit mode

It's a single contig! I also thought Prokka would be more conservative. Maybe it's the parameters I'm not using? I'm using the most basic command:

% prokka contigs.fa
ADD REPLY
1
Entering edit mode

Hmm, how much depth of coverage did you end up with from the ONT data?

I would suggest using a more elaborate command yes. I typically use some or all of the following options (some are optional and proteins is only relevant if you have a database of trusted proteins - which may help you out if you do in that case)

--addgenes
--locustag xxx     # Optional, but I like to be consistent with my tag forms
--compliant
--genus xxx
--species xxx       # Again these 3 are optional but pad the metadata in the header
--strain xxx
--gram neg  # Or pos if yours is positive
--proteins myproteins.fa      # File of trusted proteins, e.g. from an NCBI reference

How long is the genome you've ended up with?

ADD REPLY
0
Entering edit mode

Coverage is more than a 1000X, which is quite excessive. It ended up almost exactly as long as the reference, ~5Mb.

Thanks! Let me try play around with the parameters.

ADD REPLY
0
Entering edit mode

That should be more than enough coverage for accuracy, but that could have caused its own issues.

I would try downsampling it to between 100-200X coverage and try reassembling. I'm not sure if this is still an issue for long read assemblers, for short read assemblers, having too much coverage can make them choke.

ADD REPLY
0
Entering edit mode

Good point, although I'm not sure that would help. From a macro level, the assembly is correct. Substitution errors are also fairly low, considering these are nanopore reads. The problem is still the indels errors which are systemic to nanopore reads causing frameshifts. I'm just surprised that they would mess with gene prediction so significantly.

ADD REPLY
0
Entering edit mode

Is there a reason you are depending on gene prediction? There are plenty of E. coli genomes available and aligning to closest one should give you an idea of where genes are (and errors in your assembly).

ADD REPLY
0
Entering edit mode

That's right, although I'm mainly interested in using E. coli as pipeline validation. If I can get decent results with E. coli then I could be more confident going forwards that novel genomes, for which I don't have a reference, would do similarly well.

ADD REPLY
0
Entering edit mode

It would be inappropriate to assume that if the procedure works for a well known genome, it will work for others especially unknown ones. Every dataset is going to be unique and will need individual attention (if you are interested in getting accurate assemblies).

Your result with Escherichia are illustrating this already.

ADD REPLY
0
Entering edit mode

So I downsampled to 200X, reassembled with Flye and used more specific parameters in Prokka.

--kingdom Bacteria --genus Escherichia --usegenus --addgenes --gram neg

I still get ~10000 CDS in my Prokka output. I guess we can rule out coverage messing up the assembly.

ADD REPLY
1
Entering edit mode

I think it must just be the indel rate in your assembly in that case. I think you'll need to throw some short reads into the mix, but it depends what the end goal is

ADD REPLY
0
Entering edit mode

Yep, I think to get accurate gene annotations, I'll have to resort to short reads!

ADD REPLY
0
Entering edit mode

I have not been able to install GLIMMER on Windows 10. Please I need help on this. What I am trying to use it for is to check for the availability of some known genes like ArsD, ArsC and others in some bacteria genomes that I have downloaded their sequences in FASTA format. Please, I need enlightenment on how how I can achieve this either with glimmer or using any other reliable protocol. Thank you.

ADD REPLY
0
Entering edit mode

Please open a new question with as much detail as you can. Answers are reserved for actual answers to the OP question/post.

I'll give you a hint right now though: abandon Windows 10 (install the Linux Subsystem or something equivalent.)

ADD REPLY

Login before adding your answer.

Traffic: 1445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6