Count read with summarizeOverlaps result 0 for all sample
0
0
Entering edit mode
7.9 years ago
bharata1803 ▴ 550

Hello,

I have RNA-seq data and I aligned it with tophat to human genome reference from UCSC. I also download the human gene transcript file from UCSC. After I do the alignment with tophat, I tried to follow this method: http://www.bioconductor.org/help/workflows/rnaseqGene/

I already get the count reads matrix from my data after I perform summarizeAlignment. The problem is, I need to check for several specific gene (I checked the gene id from UCSC gene browser) and I get all 0 value from the count reads matrix for my specific genes. I believe if the count reads is 0 there is nothing to analyze but I think it is impossible because I know that for that gene, it will be expressed quite a lot from that tissue. I think I made a mistake in my procedure even though I followed exactly as the example. Can you all give some suggestion at what point probably I made a mistake? My code is exactly the same as the example. Thank you.

RNA-Seq • 3.6k views
0
Entering edit mode

Start by looking at the alignments in IGV or a similar tool. That's the simplest way to be sure that you should be getting counts. If the counts you expect from IGV don't match those you're getting in R then you'll need to show what commands you're typing in R.

0
Entering edit mode

Ok. I'm checking right now with IGV. But I still don't understand how to read this chart. But I think it's not 0 count.

Edit: I take a screenshot in this link https://www.dropbox.com/s/smbdvj7uxclhvhz/igv_snapshot.png?dl=0

My gene target is UGT1A1. It seems it has some read count.

0
Entering edit mode

Indeed, you should get something then. As mark.ziemann said, the most common cause of this problem is that you have BAM and GTF files that use different chromosome names. For UCSC, all chromosome names should start with "chr". For the BAM file, just do a quick samtools view -H something.bam and ensure that the @SQ lines all have appropriate chromosome names. Presuming you're using a GRanges object of some sort with summarizeOverlaps() in R, then just use the seqlevels() accessor to ensure that the chromosome names are the same as in the BAM file.

0
Entering edit mode

Thank you for your suggestion. I already check it and the chromosome name start with chr. I think I use the correct UCSC GTF file. I recheck with seqinfo command inside R to make sure and this is the result:

seqnames             seqlengths isCircular genome
chr1                  248956422       <NA>   <NA>
chr10                 133797422       <NA>   <NA>
chr10_GL383545v1_alt     179254       <NA>   <NA>
chr10_GL383546v1_alt     309802       <NA>   <NA>
chr10_KI270824v1_alt     181496       <NA>   <NA>


I also open the GTF file I used, this is the example

chr1    hg38_knownGene    exon    11874    12227    0.000000    +    .    gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg38_knownGene    exon    12613    12721    0.000000    +    .    gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg38_knownGene    exon    13221    14409    0.000000    +    .    gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg38_knownGene    exon    11874    12227    0.000000    +    .    gene_id "uc001aab.3"; transcript_id "uc001aab.3";
chr1    hg38_knownGene    exon    12646    12697    0.000000    +    .    gene_id "uc001aab.3"; transcript_id "uc001aab.3";
chr1    hg38_knownGene    exon    13221    14409    0.000000    +    .    gene_id "uc001aab.3"; transcript_id "uc001aab.3";
chr1    hg38_knownGene    start_codon    12190    12192    0.000000    +    .    gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";


And for seqlevels result, there is no problem too

[1] "chr1"                 "chr10"                "chr10_GL383545v1_alt" "chr10_GL383546v1_alt" "chr10_KI270824v1_alt"
[6] "chr10_KI270825v1_alt" "chr11"                "chr11_GL383547v1_alt" "chr11_JH159136v1_alt" "chr11_JH159137v1_alt"
[11] "chr11_KI270826v1_alt" "chr11_KI270827v1_alt" "chr11_KI270829v1_alt" "chr11_KI270830v1_alt" "chr11_KI270831v1_alt"
[16] "chr11_KI270832v1_alt" "chr11_KI270902v1_alt" "chr11_KI270903v1_alt" "chr11_KI270927v1_alt" "chr12"
[21] "chr12_GL383549v1_alt" "chr12_GL383550v2_alt" "chr12_GL383551v1_alt" "chr12_GL383552v1_alt" "chr12_GL383553v2_alt"


I assume the step to prepare Bam files and read the GTF file in R is correct. So, this is the code to call summarizeOverlaps:

library(GenomicFeatures)

txdb <- makeTxDbFromGFF("gtf/ucscGTF.gtf",format="gtf")
genes <- exonsBy(txdb, by="gene")

library(GenomicAlignments)

se <- summarizeOverlaps(features=genes,       reads=bamObject,mode="Union",singleEnd=TRUE,ignore.strand=FALSE,fragments=FALSE)


Is there any problem with that? I know the sequence is single end but I'm notsure for strand parameter. I tried both TRUE and FALSE it still result in 0 count.

0
Entering edit mode

I saw this picture of how counting mode is done

Do you think my problem came from this ambiguous result? After I check my gene, the exon is actually used by several gene from the same gene group (UGT1A). In IGV, I noticed that the exon used to transcript UGT1A1 also used by several UGT1A variant. Actually, I remember when I tried the cufflinks workflow, the FPKM I get was the combination for all UGT1A variant, not the single UGT1A1 gene only. Probably you have some suggestion how to handle this?

0
Entering edit mode

The gene you're interested in doesn't appear to overlap anything else. Give htseq-count or featureCounts a try. They can both also produce debugging information in the form of a SAM/BAM file with each alignment annotated as to how it was counted (or not) and why.

0
Entering edit mode

Thank you. I will try htseq-count and featureCounts for comparison. But I have a question according to the overlapped gene. I browse my gene locus in UCSC genome browser and it seems 4 exon of my gene is also used by another UGT1A gene. Do you think this will cause some problem? Actually, I already tried another method via cufflinks but the cufflink workflow result is a bit confusing. They mapped the FPKM of the locus for all of the UGT1A. So, when I tried to make a comparison chart using cummeRbund, the FPKM result is the total of the UGT1A* gene. Because I want to check only UGT1A1, I don't know what to do because I think separating the FPKM result would be really hard and I tried to move to count reads.

0
Entering edit mode

htseq-count/featureCounts/etc. generally produce counts for genes, not transcripts. If you want a reliable transcript expression metric then your best bet is rsem, eXpress, etc.

0
Entering edit mode

What is the difference between genes count and transcripts count? I want to compare the change fold of gene expression level and use that level to study the correlation between SNP mutation and effect to expression level. Probably you have some suggestion what is the best way to do this?

0
Entering edit mode

Gene and transcript counts are exactly what they sound like, counts of alignments to them. If you just want gene-level metrics then use gene-level counts.

0
Entering edit mode

Thank you. The result from HTSeq is out. Most of the result is ambigous. I tried the intersection strict and union and the result is around the same number. So, The read in my BAM file seems overlapped between several genes like in the picture with ambiguous result for union and intersection strict. From this genome browser link: http://goo.gl/2SejDU the UGT1A gene family share many exon for different gene. It seems the count application find it ambiguous because of that. The result from cuffdiff (FPKM values) also in the form of accumulation of all UGT1A gene family. And now I'm confused how to interpret this.

0
Entering edit mode

That's a weird case. You could use something like eXpress to get reliable estimated counts using an expectation maximization procedure, but I'm surprised those genes are joined into a single gene with multiple isoforms, given the shared exons. I assume there's a biological reason for this, but I'm not familiar with that family of genes.

0
Entering edit mode

Are you 100% sure the genome fasta file and the GTF file correspond to the same genome build? A slight difference in the coordinates would explain the lack of assigned reads.

0
Entering edit mode

Thank you for your reply. I'm sure when I aligned the BAM file, I use both from UCSC. My guess is I think I made a mistake when I call the summarizeOverlaps function. I wrote the code in above reply. Probably you notice something wrong?

0
Entering edit mode

I see the GTF file is based on hg38 build of the human genome. Is this consistent with the fasta file?

0
Entering edit mode
I mapped to IGV with Hg38, it is correct.