Question: training genemarkS to call proteins from virus genomes
gravatar for Guillermo D. Huerta
2.1 years ago by
Columbus (Ohio)
Guillermo D. Huerta10 wrote:

Hi all,

I am trying to perform self-training with GeneMarkS to improve protein calling from virus genomes and transcripts. Could someone tell me if it is correct what I am doing? First, I download eukaryotic viruses from NCBI Refseq to create a "matrix" using

/fs/project/PAS1117/modules/GeneMarkS/3.36/ -euk --name virusgroup1 --gm /virusgroup1_refseq_genomes.fasta

which generated (among many others) the following model files:

virusgroup1_gm_heuristic.mat virusgroup1_gm.mat virusgroup1_hmm_combined.mod virusgroup1_hmm_heuristic.mod virusgroup1_hmm.mod

then I used the one named "virusgroup1_gm.mat" to run genemark against a single virus genome (that belongs theoretically to group 1, so GeneMark should call correctly all its viral genes):

/fs/project/PAS1117/modules/GeneMarkS/3.36/gm -m group1_gm.mat -l o q -o p -r p -v NC_023420-2.fasta

nevertheless, I only get a file named "NC_023420-2.fasta.lst" with a few gene coordinates, BUT NO PROTEIN FILE (even having set the options for that):

List of Open reading frames predicted as CDSs, shown with alternate starts (regions from start to stop codon w/ coding function >0.50)

Left Right DNA Coding Avg Start end end Strand Frame Prob Prob

  42      4046  direct      fr 3   0.60  ....  
 195      4046  direct      fr 3   0.60  0.79  
 297      4046  direct      fr 3   0.60  0.17  
 333      4046  direct      fr 3   0.60  0.10  
 537      4046  direct      fr 3   0.61  0.06  
 570      4046  direct      fr 3   0.60  0.12

List of Regions of interest (regions from stop to stop codon w/ a signal in between)

LEnd REnd Strand Frame

   21      4046  direct      fr 3

Can you guess what is wrong?

Thanks in advanced, Guillermo

