highly variable FPKMs on small nucleolar RNAs and conf_lo=0 in cufflinks output??
0
0
Entering edit mode
9.8 years ago

I'm using Galaxy to analyze some mouse RNA seq samples. I'm encountering a strange issue that I think is coming up in Cufflinks. This is a screenshot of my cufflinks settings. I'm getting weird FPKM values across different samples for a bunch of Snords and MIRs. For example, in one sample, these genes are very high and apparently absent in the other three

Snord34
Snord2
Snord73a
Mir707
Mirlet7i
Snord35b
Mir3096b
Snora68
Mir692-1

If I inspect the BAM file of accepted hits generated by Tophat2, I can see the samples that gave a value of 0 are actually quite high and similar to the sample that gave a value of 2700. So, it looks like Tophat is doing a good job of placing the reads, but cufflinks is improperly assigning FPKM values.

Also, I've noticed that the conf_lo and conf_hi values are bit goofy. For example, for Snord34 FPKM, FPKM_conf_lo, FPKM_conf_hi values for two of the four samples are:

           FPKM    FPKM_conf_lo    FPKM_conf_hi
sample 1   0       0               0
sample 2   2704.34 0               2.56167

So the issue is that if I'm looking at differentially regulated genes, these sorts of genes are popping up and I have to inspect them all manually in IGV to see if indeed there are different counts. I should say that each sample seems to have its own signature of weird Snords and other genes. Put another way, there are about 10 genes, usually small nucleolar RNAs, that give FPKMs of >2000 for one sample and 0 for the other three and another 10 genes for another sample.

cufflinks galaxy RNA-Seq • 2.6k views
ADD COMMENT
1
Entering edit mode

What happens if you just directly count reads with either htseq-count or featureCounts? Those are both easier to debug than an expectation-maximization algorithm like cufflinks. Perhaps many of those reads are ambiguously mapped, so the difference you see in the samples is simply due to slight differences in unique aligners pulling the EM algorithm in one direction or another...though this is just wild speculation. Looking at the raw counts produced by one of the aforementioned programs should help elucidate things.

ADD REPLY
0
Entering edit mode

I'm using galaxy so I'm sure how to just directly count reads. If I understand correctly, looking at the bam files in IGV should give me an idea about the raw counts. In the IGV grab I can see that there are similar number of reads across the different samples for Snord34, but again, for one sample I see and FPKM value of 2700 and the other samples are 0.

ADD REPLY
0
Entering edit mode

htseq-count is in the Galaxy tool shed, so it should be possible to use that. Having said that, you'll find that Galaxy is very limiting

ADD REPLY
0
Entering edit mode

I'm relatively new to this, so I'm only on the main galaxy server and don't think I can add tools as a non-admin user. so, there's really no way to diagnose this without raw counts? I see that I can move over to another server (GVL-QLD looks good) that has an expanded numbers of tools, including HT-seq. This raises another question: can you, or anyone else, recommend a server that features an expanded set of tools? I'd love to have some more GO term analysis available as well as the kinds of basic tools essential for troubleshooting. I realize Galaxy is, as you say, limiting, but I'm not prepared to dive in any deeper at this stage- I'm just looking to manage some hopefully straightforward RNA-seq and Chip-seq datasets.

ADD REPLY

Login before adding your answer.

Traffic: 2618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6