I have RNA-seq data and I aligned it with tophat to human genome reference from UCSC. I also download the human gene transcript file from UCSC. After I do the alignment with tophat, I tried to follow this method: http://www.bioconductor.org/help/workflows/rnaseqGene/
I already get the count reads matrix from my data after I perform summarizeAlignment. The problem is, I need to check for several specific gene (I checked the gene id from UCSC gene browser) and I get all 0 value from the count reads matrix for my specific genes. I believe if the count reads is 0 there is nothing to analyze but I think it is impossible because I know that for that gene, it will be expressed quite a lot from that tissue. I think I made a mistake in my procedure even though I followed exactly as the example. Can you all give some suggestion at what point probably I made a mistake? My code is exactly the same as the example. Thank you.
Start by looking at the alignments in IGV or a similar tool. That's the simplest way to be sure that you should be getting counts. If the counts you expect from IGV don't match those you're getting in R then you'll need to show what commands you're typing in R.
Ok. I'm checking right now with IGV. But I still don't understand how to read this chart. But I think it's not 0 count.
Edit: I take a screenshot in this link https://www.dropbox.com/s/smbdvj7uxclhvhz/igv_snapshot.png?dl=0
My gene target is UGT1A1. It seems it has some read count.
Indeed, you should get something then. As mark.ziemann said, the most common cause of this problem is that you have BAM and GTF files that use different chromosome names. For UCSC, all chromosome names should start with "chr". For the BAM file, just do a quick
samtools view -H something.bamand ensure that the
@SQlines all have appropriate chromosome names. Presuming you're using a
GRangesobject of some sort with
summarizeOverlaps()in R, then just use the
seqlevels()accessor to ensure that the chromosome names are the same as in the BAM file.
Thank you for your suggestion. I already check it and the chromosome name start with chr. I think I use the correct UCSC GTF file. I recheck with seqinfo command inside R to make sure and this is the result:
I also open the GTF file I used, this is the example
And for seqlevels result, there is no problem too
I assume the step to prepare Bam files and read the GTF file in R is correct. So, this is the code to call
Is there any problem with that? I know the sequence is single end but I'm notsure for strand parameter. I tried both TRUE and FALSE it still result in 0 count.
I saw this picture of how counting mode is done
Do you think my problem came from this ambiguous result? After I check my gene, the exon is actually used by several gene from the same gene group (UGT1A). In IGV, I noticed that the exon used to transcript UGT1A1 also used by several UGT1A variant. Actually, I remember when I tried the cufflinks workflow, the FPKM I get was the combination for all UGT1A variant, not the single UGT1A1 gene only. Probably you have some suggestion how to handle this?
The gene you're interested in doesn't appear to overlap anything else. Give htseq-count or featureCounts a try. They can both also produce debugging information in the form of a SAM/BAM file with each alignment annotated as to how it was counted (or not) and why.
Thank you. I will try htseq-count and featureCounts for comparison. But I have a question according to the overlapped gene. I browse my gene locus in UCSC genome browser and it seems 4 exon of my gene is also used by another UGT1A gene. Do you think this will cause some problem? Actually, I already tried another method via cufflinks but the cufflink workflow result is a bit confusing. They mapped the FPKM of the locus for all of the UGT1A. So, when I tried to make a comparison chart using cummeRbund, the FPKM result is the total of the UGT1A* gene. Because I want to check only UGT1A1, I don't know what to do because I think separating the FPKM result would be really hard and I tried to move to count reads.
htseq-count/featureCounts/etc. generally produce counts for genes, not transcripts. If you want a reliable transcript expression metric then your best bet is rsem, eXpress, etc.
What is the difference between genes count and transcripts count? I want to compare the change fold of gene expression level and use that level to study the correlation between SNP mutation and effect to expression level. Probably you have some suggestion what is the best way to do this?
Gene and transcript counts are exactly what they sound like, counts of alignments to them. If you just want gene-level metrics then use gene-level counts.
Thank you. The result from HTSeq is out. Most of the result is ambigous. I tried the intersection strict and union and the result is around the same number. So, The read in my BAM file seems overlapped between several genes like in the picture with ambiguous result for union and intersection strict. From this genome browser link: http://goo.gl/2SejDU the UGT1A gene family share many exon for different gene. It seems the count application find it ambiguous because of that. The result from cuffdiff (FPKM values) also in the form of accumulation of all UGT1A gene family. And now I'm confused how to interpret this.
That's a weird case. You could use something like eXpress to get reliable estimated counts using an expectation maximization procedure, but I'm surprised those genes are joined into a single gene with multiple isoforms, given the shared exons. I assume there's a biological reason for this, but I'm not familiar with that family of genes.
Are you 100% sure the genome fasta file and the GTF file correspond to the same genome build? A slight difference in the coordinates would explain the lack of assigned reads.
Thank you for your reply. I'm sure when I aligned the BAM file, I use both from UCSC. My guess is I think I made a mistake when I call the summarizeOverlaps function. I wrote the code in above reply. Probably you notice something wrong?
I see the GTF file is based on hg38 build of the human genome. Is this consistent with the fasta file?