Hello,
I ran cufflinks on 18 different replicates like this:
time cufflinks -p 8 accepted_hits.bam -g ../mm10genes.gtf
and then transcript counting with bedtools.
Each column is "transcript gene " and counts for "3 female control, 3 female condition 1, 3 female condition 2, and 3 male control, 3 male condition 1, 3 male condition 2"
and I am at a loss why I have
con@ubuntu:~/RSAB-BPA/basic/Fastq/FGC1037_s_5_GTGAAA$ grep Xist ../gene.count
NR_001463 Xist 21993 11604 6711 14790 6150 0 5450 2974 43627 9 52 1 66 5 52 18 247 32
NR_001570 Xist 13192 6975 4059 8543 3464 0 3289 1773 26716 8 30 1 35 2 35 10 152 16
con@ubuntu:~/RSAB-BPA/basic/Fastq/FGC1037_s_5_GTGAAA$ grep Gapdh ../gene.count
NM_001289726 Gapdh 50279 4550 3184 22663 2130 0 7702 4332 37018 6114 4938 4493 5059 9287 3420 6147 9016 8596
NM_008084 Gapdh 50290 4551 3184 22662 2129 0 7703 4333 37030 6115 4937 4495 5059
con@ubuntu:~/RSAB-BPA/basic/Fastq/FGC1037_s_5_GTGAAA$ grep Actb ../gene.count
NM_007393 Actb 202625 20199 12112 76439 8610 0 49824 26691 133110 62113 17415 16526 23621 36352 17794 27867 36203 43491
but the gene expression looks fine in a genome browser. I don't understand what I could be doing wrong or what I should be looking for.
I have two questions:
- Why do so many genes show 0 expression?
- Why do some replicates show consistently higher expression of certain genes?
-DEC
Given that the same sample has 0 counts for all of the genes you showed, have you seen if it has problems? My guess is that this samples clusters far away from everything else and should probably just get excluded.
As Devon suggested it is always better to perform some clustering for studies that involve tens or hundreds of samples to identify outliers before any further analysis. The discrepancy in read counts could be attributed to one of the several factors 1) difference in sequencing depth 2) difference in the complexity of RNA-seq libraries. For your housekeeping genes counts only one sample gives zero counts. Although there is a considerable variation across different samples for house keeping genes but it could be purely due to difference in sequencing depth. I would try normalizing these samples using TMM method and see if it helps reducing the variation in expression for the housekeeping genes.