Question: C Albicans Genome Count Matrix Total Genes
0
gravatar for kpr
2.3 years ago by
kpr60
kpr60 wrote:

Full Disclosure: I am really new to RNA sequencing.

I am using bowtie, tophat, and htseq to build a counts matrix of reads for my samples. I am using the "chromosomes" file to build my reference genome from CGD. Everything seems to be going well.

My understanding is that there are 6620 total features for haploids. My data set is diploid, which should give me 13,280 total features. However, when I look at my resulting counts matrix, I have approx 12,800 rows. Shouldn't I expect 13,280 rows because each row corresponds to a feature?

Numbers are from: http://www.candidagenome.org/cache/C_albicans_SC5314_genomeSnapshot.html

rna-seq sequence gene • 799 views
ADD COMMENTlink modified 18 months ago by h.mon27k • written 2.3 years ago by kpr60

What do you mean by feature? RNA sequencing typically uses genes or transcripts to quantify against. Diploid means two copies of the chromosome, each will contain mostly identical features (an allele of a gene is still the same gene). The fact that the table that you link to doubles the features confuses me.

ADD REPLYlink written 2.3 years ago by Istvan Albert ♦♦ 81k
1
gravatar for h.mon
18 months ago by
h.mon27k
Brazil
h.mon27k wrote:

The easiest explanation is some features classes are not present on your annotation. If you used a "gene" gtf, some features were probably left out, like "centromere", "repeat_region", etc. As I am not familiar with this genome, I can't say for sure.

My understanding is that there are 6620 total features for haploids. My data set is diploid, which should give me 13,280 total features.

The math is not that straight-forward: 2 * 6620 = 13240 != 13280. In fact, if you sum up features per chromosome set (A and B), you will see chr set A has 6615, and chr set B has 6613 features. The 6620 features for the "haploid total" means a few features are found on set A but not set B, and vice-versa. The 13280 comes from chr set A set + chr set B set + mitochondrial genome ( 6615 + 6613 + 52 = 13280 ).

All in all, a rather confusing genome and annotation, for people used to "normal" haploid reference genomes. You should read carefully the papers describing the assembly and annotation of this genome, and any papers describing updates.

Bear in mind that this will have a (minor) implication to your later question, Generate a counts matrix with paired-end non-stranded samples.

ADD COMMENTlink modified 18 months ago • written 18 months ago by h.mon27k
2

The Candida community are a bit weird in their gene annotations as they have independent gene IDs for genes on each chromatid of the diploid. This is presumably because at least lab strains cannot reproduce sexually, and so the two copies can diverge independently and each copy is on a fixed haplotype.

Chr set A contains 7 genes not found on Chr set B, and Chr set B contains 5 genes not found on A. Presumably these have been lost in a deletion event at some point. In addition there are the 52 mitochondrial genes (which of course actually exist at a ploidy way in excess of diploid, but are only annotated once) and the rDNA array "RDN1", which is only listed once.

I'd be very careful about your mapping settings when mapping to this because I guess almost all reads will map to more than one location, and this may confuse your mapper depending on the settings.

ADD REPLYlink written 18 months ago by i.sudbery5.9k

Thanks i.sudbery! I ended up separating the A and B allele and mapping them separately. This appears to have solved the issue? I don't have a lot of experience with a phased genome. Any recommendations on finding an example pipeline for them?

ADD REPLYlink written 18 months ago by kpr60
1

Chr set A contains 7 genes not found on Chr set B, and Chr set B contains 5 genes not found on A.

As i.sudbery explained, you will be missing 5 or 7 genes, depending on which haplotype you selected. This is 0.1% of the total genes, I could live with that. But if you want to be perfectionist, you could add the missing genes sequences to your reference haploid set - you will have to find them, though.

ADD REPLYlink written 18 months ago by h.mon27k

h.mon you have been such a great help. I really appreciate it. :)

ADD REPLYlink written 18 months ago by kpr60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1505 users visited in the last hour