CEGMA report meaning
3
1
Entering edit mode
6.8 years ago
Mehmet ▴ 720

Dear All,

I would like to ask you about the CEGMA output report. For example, what is the range of Average? I sometimes get 1.66 or 1.22. which values are important about genome assembly or genome?

Thanks.

Assembly genome gene alignment • 3.5k views
0
Entering edit mode

There is no good or bad answer to this. The average number of orthologs per predicted CEG might be expected to be much higher in polyploid genomes that have undergone several whole genome duplications. However, it might also be higher due to a genome assembly that has fully resolved heterozygous regions into two contigs. I.e. if gene X is sufficiently different in the two parental genomes (assuming diploid organism) then a genome assembler might assemble this into two separate sequences. This will artificially inflate many of the CEGMA output statistics.

The CEGMA statistics are most useful when you can do one of two things:

1. Compare the output of CEGMA to a different genome assembly from the same underlying data (perhaps one that used a different assembler or different assembly parameters)
2. Use the CEGMA information as part of many different statistics that report on the quality of your genome assembly.
1
Entering edit mode
6.8 years ago
arnstrm ★ 1.8k

You should be looking at the output.completeness_report file for interpreting CEGMA results. A sample output is pasted below:

#      Statistics of the completeness of the genome based on 248 CEGs      #
#Prots  %Completeness  -  #Total  Average  %Ortho
Complete      217       87.50      -   308     1.42     30.41
Group 1       55       83.33      -    75     1.36     25.45
Group 2       51       91.07      -    63     1.24     19.61
Group 3       51       83.61      -    80     1.57     41.18
Group 4       60       92.31      -    90     1.50     35.00

Partial      243       97.98      -   427     1.76     49.38
Group 1       64       96.97      -    99     1.55     35.94
Group 2       56      100.00      -    89     1.59     41.07
Group 3       60       98.36      -   118     1.97     65.00
Group 4       63       96.92      -   121     1.92     55.56

#    These results are based on the set of genes selected by Genis Parra   #
#    Key:                                                                  #
#    Prots = number of 248 ultra-conserved CEGs present in genome          #
#    %Completeness = percentage of 248 ultra-conserved CEGs present        #
#    Total = total number of CEGs present including putative orthologs     #
#    Average = average number of orthologs per CEG                         #
#    %Ortho = percentage of detected CEGS that have more than 1 ortholog   #

Here, there are 217 complete and 26 partial (i.e., 243 - 217 = 26) core eukaryotic genes (out of total 248 genes) present in your assembly.  Groups are just categorized  core genes based on functional annotation (I guess). Normally, you don't have to worry about the %Ortho or Average (i.e., number of ortholgos per gene).  It might matter if your genome is polyploid or something like that.

Another key number you might need is the number of sequences in output.cegma.dna file (just do grep -c ">" output.cegma.dna). This number will tell you how many of the total CEGMA genes (larger subset of 458 genes) are present. This one includes the 248 set as well. In my case it was 453. So, all my report needs is:

243 out of 248, 443 out of 458,  CEGMA genes were predicted in the genome

I hope this helps.

0
Entering edit mode

What is the difference between the 248 CEGs and the 458 CEGMA genes ?

0
Entering edit mode

is there any ideal cut off completeness value for transcriptome? can we combine complete and partial detected gene and represent.I have got cegma result for transcriptome, but don’t know whether this following result is acceptable. I would be very grateful if you could comment on my problem.

COMPLETENESS ASSESSMENT RESULTS: Total number of core genes queried 248 Number of core genes detected   Complete 187 (75.40%)   Complete + Partial 235 (94.76%) Number of missing core genes 13 (5.24%) Average number of orthologs per core genes 3.13 % of detected core genes that have more than 1 ortholog 94.12

Regards rahul

1
Entering edit mode

Hi

As far as I know there is no specific cutoff value. I recommend you use BUSCO for transcriptome quality assessment, then compare results. You should not combine complete and partial results, as they are different.

0
Entering edit mode

Thanks for your suggestion.I have done busco and got following result. Is there any cut off completeness value for Busco .

Completeness Assessment Results: Total # of core genes queried: 429 # of core genes detected Complete: 223 (51.98%) Complete + Partial: 327 (76.22%) # of missing core genes: 102 (23.78%) Average # of orthologs per core genes: 1.78 % of detected core genes that have more than 1 ortholog: 69.06 regards Rahul

0
Entering edit mode

Thanks for your suggestion.I have done busco and got following result. Is there any cut off completeness value for Busco .

Completeness Assessment Results: Total # of core genes queried: 429 # of core genes detected Complete: 223 (51.98%) Complete + Partial: 327 (76.22%) # of missing core genes: 102 (23.78%) Average # of orthologs per core genes: 1.78 % of detected core genes that have more than 1 ortholog: 69.06 regards Rahul

1
Entering edit mode

Please give more details about your transcriptome. Did you assembly RNA reads, if yes how? I mean did you sequence RNA or download SRA data then make assembly? Which k-met value did you use for assembly and which tool (Trinity etc)?. Which command did you use in BUSCO and which version of BUSCO (v2 ?)? Which database did you use in BUSCO command? Eukaryote? Also did you check contamination (bacteria, host etc) using a tool ( kraken etc). Then, we can speak in more details. Sorry I am asking many questions. You need to have > 90 % completetenes results to be able to say that transcriptome is fine for downstream analyses in my opinion.

0
Entering edit mode

Actually I have downloaded pair end RNA seq reads SRA349650 for assembly. reads were cleaned by trimmomatics:- adapter cleaning,q20, max read length 30bp. Assembled by Trinity with Group pairs distance- 500 bp,path reinforcement:- 50bp,min legth-200bp, assembled sequences were used for cap3 :- overlap 40bp, 90% identity. Then used for cegma and busco (BUSCO eukaryotes) (https://gvolante.riken.jp/index.html).

1
Entering edit mode

One more thing, did you check contamination? When I downloaded a SRA data, it was found that SRA data contains 20% bacteria. Then, I did assembly but completeness was very low. Please check contamination. I recommend kraken tool.

0
Entering edit mode

Thanks for your comments.I did not check the contamination.but I will check and remove it. Even I did not cut the starting bases from raw end as there was noise upto 12bp.I tried with cutting 12 bp earlier but the results were very worse.

0
Entering edit mode
6.8 years ago
Reema Singh ▴ 160

I hope this will helps

0
Entering edit mode
6.8 years ago
Mehmet ▴ 720

Thank you so much all. Core genes were separated into four groups according to their conservative degree.