Question

CEGMA on human assembly

0

Entering edit mode

8.9 years ago

ulike • 0

Dear Biostar users,

I tried to run CEGMA on hg19 but failed to find all CEGs. This is weird as the author was able to find all of them in his paper. Could you suggest what's wrong here? Thanks a lot!

#!/bin/bash
set -e

##add 'scaffold' string to p_ctg.fa sequence IDs
perl -pe 's/^(>\d+)\s/$1scaffold /' human_g1k_v37.fasta > human_g1k_v37.newheader.fasta
source /home/ulike/Downloads/CEGMA_v2-2.5/bin/activate
cegma --genome human_g1k_v37.newheader.fasta -o human_g1k_v37 -threads 12 -v -ext

********************************************************************************
**                    MAPPING PROTEINS TO GENOME (TBLASTN)                    **
********************************************************************************

RUNNING: genome_map  -n genome -p 6 -o 5000 -c 2000 -t 12  -v  /home/ulike/Downloads/CEGMA_v2-2.5/data/kogs.fa human_g1k_v37.newheader.fasta 2>human_g1k_v37.cegma.errors

********************************************************************************
**     MAKING INITIAL GENE PREDICTIONS FOR CORE GENES (GENEWISE + GENEID)     **
********************************************************************************

RUNNING: local_map -n local -f -h /home/ulike/Downloads/CEGMA_v2-2.5/data/hmm_profiles -i KOG -v  genome.chunks.fa 2>human_g1k_v37.cegma.errors
NOTE: created 2140 geneid predictions

********************************************************************************
**           FILTERING INITIAL PROTEINS PRODUCED BY GENEID (HMMER)            **
********************************************************************************

RUNNING: hmm_select -i KOG -o local -t 12 -v  /home/ulike/Downloads/CEGMA_v2-2.5/data/hmm_profiles local.geneid.fa /home/ulike/Downloads/CEGMA_v2-2.5/data/profiles_cutoff.tbl 2>human_g1k_v37.cegma.errors
NOTE: Found 438 geneid predictions with scores above threshold value

********************************************************************************
**       CALCULATING GENEID PARAMETERS FROM SELECTED GENEID PREDICTIONS       **
********************************************************************************

RUNNING: geneid-train -v local.geneid.selected.gff local.geneid.selected.dna geneid_params 2>human_g1k_v37.cegma.errors
RUNNING: make_paramfile /home/ulike/Downloads/CEGMA_v2-2.5/data/self.param.template \
            geneid_params/coding.initial.5.logs geneid_params/coding.transition.5.logs \
            geneid_params/start.logs geneid_params/acc.logs geneid_params/don.logs \
            geneid_params/intron.max >  geneid_params/self.param

********************************************************************************
**                           ACCURATE LOCAL MAPPING                           **
********************************************************************************

RUNNING: local_map -n local_self -g local.genewise.gff -d geneid_params/self.param -h /home/ulike/Downloads/CEGMA_v2-2.5/data/hmm_profiles -i KOG -v  genome.chunks.fa 2>human_g1k_v37.cegma.errors
NOTE: created 2241 geneid predictions

********************************************************************************
**                              FINAL FILTERING                               **
********************************************************************************

RUNNING: hmm_select -i KOG -o local_self -t 12 -v  /home/ulike/Downloads/CEGMA_v2-2.5/data/hmm_profiles local_self.geneid.fa /home/ulike/Downloads/CEGMA_v2-2.5/data/profiles_cutoff.tbl 2>human_g1k_v37.cegma.errors
NOTE: Found 456 geneid predictions with scores above threshold value

********************************************************************************
**         CONVERTING LOCAL COORDINATES INTO GENOME-WIDE COORDINATES          **
********************************************************************************

********************************************************************************
**    EVALUATING RESULTS AND COMPARING TO SET OF 248 HIGHLY CONSERVED CEGS    **
********************************************************************************

RUNNING: completeness -v  local_self.hmm_select.aln /home/ulike/Downloads/CEGMA_v2-2.5/data/completeness_cutoff.tbl > human_g1k_v37.completeness_report
NOTE: Using the more stringent cutoff criteria, 149/248 complete core eukaryotic genes (CEGs) were detected.

********************************************************************************
**                             CEGMA HAS FINISHED                             **
********************************************************************************

grep '>' human_g1k_v37.cegma.dna  | wc -l
456

human_g1k_v37.completeness_report

#      Statistics of the completeness of the genome based on 248 CEGs      #

              #Prots  %Completeness  -  #Total  Average  %Ortho

  Complete      149       60.08      -   381     2.56     71.14

   Group 1       38       57.58      -    83     2.18     63.16
   Group 2       36       64.29      -    74     2.06     63.89
   Group 3       35       57.38      -   104     2.97     80.00
   Group 4       40       61.54      -   120     3.00     77.50

   Partial      233       93.95      -   925     3.97     86.27

   Group 1       58       87.88      -   200     3.45     81.03
   Group 2       53       94.64      -   174     3.28     79.25
   Group 3       60       98.36      -   262     4.37     90.00
   Group 4       62       95.38      -   289     4.66     93.55

#    These results are based on the set of genes selected by Genis Parra   #

#    Key:                                                                  #
#    Prots = number of 248 ultra-conserved CEGs present in genome          #
#    %Completeness = percentage of 248 ultra-conserved CEGs present        #
#    Total = total number of CEGs present including putative orthologs     #
#    Average = average number of orthologs per CEG                         #
#    %Ortho = percentage of detected CEGS that have more than 1 ortholog   #

CEGMA Assembly • 2.5k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 8.9 years ago by ulike • 0

0

Entering edit mode

Did you check how many partial CEGMA genes were there? CEGMA's calculation of gene completeness is questionable so you might be having them as partials. Also, please check how many sequences are in output.cegma.dna file (it should be close to 458). You should probably paste the output.completeness_report file here rather than the run log.

ADD REPLY • link 8.9 years ago by arnstrm ★ 1.8k

0

Entering edit mode

Thanks, @arnstrm. CEGMA actually found 233/248 in the core CEGs or 456/458 for all CEGs, . This looks more reasonable now. However, since human genome is used to built the CEG set, why some genes are missed here?

ADD REPLY • link 8.9 years ago by ulike • 0

0

Entering edit mode

You might want to look at Keith's explanation of the output A: cegma : output report. I'm not sure why 2 genes are missing, but I assume it is due to incomplete assembly (perhaps?).

ADD REPLY • link 8.9 years ago by arnstrm ★ 1.8k

0

Entering edit mode

Thank you, the link helps a lot! I'm using the reference genome (human_g1k_v37.fasta) from 1000 genomes project. I guess this is the best assembly we have for human (except for hg38), right?

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by ulike • 0

0

Entering edit mode

Also, paging @keith!

EDIT: idk how to correctly tag a user, sorry!

EDIT: maybe #keith

ADD REPLY • link updated 14 months ago by Ram 43k • written 8.9 years ago by arnstrm ★ 1.8k