Question

C Albicans Genome Count Matrix Total Genes

1

Entering edit mode

8.0 years ago

kpr ▴ 80

Full Disclosure: I am really new to RNA sequencing.

I am using bowtie, tophat, and htseq to build a counts matrix of reads for my samples. I am using the "chromosomes" file to build my reference genome from CGD. Everything seems to be going well.

My understanding is that there are 6620 total features for haploids. My data set is diploid, which should give me 13,280 total features. However, when I look at my resulting counts matrix, I have approx 12,800 rows. Shouldn't I expect 13,280 rows because each row corresponds to a feature?

Numbers are from: http://www.candidagenome.org/cache/C_albicans_SC5314_genomeSnapshot.html

RNA-Seq sequence gene • 2.1k views

ADD COMMENT • link updated 7.3 years ago by h.mon 35k • written 8.0 years ago by kpr ▴ 80

0

Entering edit mode

What do you mean by feature? RNA sequencing typically uses genes or transcripts to quantify against. Diploid means two copies of the chromosome, each will contain mostly identical features (an allele of a gene is still the same gene). The fact that the table that you link to doubles the features confuses me.

ADD REPLY • link 8.0 years ago by Istvan Albert 102k

score 1 · Answer 1 · 2018-03-28

1

Entering edit mode

7.3 years ago

h.mon 35k

The easiest explanation is some features classes are not present on your annotation. If you used a "gene" gtf, some features were probably left out, like "centromere", "repeat_region", etc. As I am not familiar with this genome, I can't say for sure.

My understanding is that there are 6620 total features for haploids. My data set is diploid, which should give me 13,280 total features.

The math is not that straight-forward: 2 * 6620 = 13240 != 13280. In fact, if you sum up features per chromosome set (A and B), you will see chr set A has 6615, and chr set B has 6613 features. The 6620 features for the "haploid total" means a few features are found on set A but not set B, and vice-versa. The 13280 comes from chr set A set + chr set B set + mitochondrial genome ( 6615 + 6613 + 52 = 13280 ).

All in all, a rather confusing genome and annotation, for people used to "normal" haploid reference genomes. You should read carefully the papers describing the assembly and annotation of this genome, and any papers describing updates.

Bear in mind that this will have a (minor) implication to your later question, Generate a counts matrix with paired-end non-stranded samples.

ADD COMMENT • link 7.3 years ago by h.mon 35k

2

Entering edit mode

The Candida community are a bit weird in their gene annotations as they have independent gene IDs for genes on each chromatid of the diploid. This is presumably because at least lab strains cannot reproduce sexually, and so the two copies can diverge independently and each copy is on a fixed haplotype.

Chr set A contains 7 genes not found on Chr set B, and Chr set B contains 5 genes not found on A. Presumably these have been lost in a deletion event at some point. In addition there are the 52 mitochondrial genes (which of course actually exist at a ploidy way in excess of diploid, but are only annotated once) and the rDNA array "RDN1", which is only listed once.

I'd be very careful about your mapping settings when mapping to this because I guess almost all reads will map to more than one location, and this may confuse your mapper depending on the settings.

ADD REPLY • link 7.3 years ago by i.sudbery 21k

0

Entering edit mode

Thanks i.sudbery! I ended up separating the A and B allele and mapping them separately. This appears to have solved the issue? I don't have a lot of experience with a phased genome. Any recommendations on finding an example pipeline for them?

ADD REPLY • link 7.3 years ago by kpr ▴ 80

1

Entering edit mode

Chr set A contains 7 genes not found on Chr set B, and Chr set B contains 5 genes not found on A.

As i.sudbery explained, you will be missing 5 or 7 genes, depending on which haplotype you selected. This is 0.1% of the total genes, I could live with that. But if you want to be perfectionist, you could add the missing genes sequences to your reference haploid set - you will have to find them, though.