Entering edit mode
7.7 years ago
1769mkc
★
1.2k
I have a set of RNA -seq data , I have aligned them ,have the accepted.bam file, to view the coverage i'm using the IGV tools but it seems there is something wrong with the sample or sequencing if I see there are differences kind stark differences between the replicates at certain places both of them are control.Is there a way to do absolute quantification of both the aligned files .Since both of them are bam files , is it possible to compare both the bam files and get the difference.
Any suggestion or help would be highly appreciated
Probably, but you'll need to exactly define what you want to compare.
I just want to make sure like there are no major differences in the replicates overall. I read about deeptools which has this bamCompare function , can i use that to see if there is a differences ?
If you really want to do this with the BAM files rather than the counts, then use
multiBamSummary
from deepTools followed byplotPCA
or similar.okay i dud use the bam compare but now i will give multiBamSummary a try and PCA
Do you want to count reads per gene?
yes I would like to do that counts read per gene from both the set
Then you can use software like htseq-count and featureCounts.
yes I have taken out read counts using featurecounts .So what would be the next step?
Differential expression analysis using e.g. DESeq2.
This is an old post, but my question is slightly different - what are the relative pros and cons of using BAM inputs versus count inputs for PCA or multi-cluster to check across RNA-Seq samples (all of them or biol. reps. or whatever)?
I can imagine one of the factors being diskspace << counts input compared to BAM inputs, but I've not empirically yet checked if the PCA or clustering calculations are faster and/or require less RAM requirements for count inputs vs. BAM inputs.
The other question is whether results from counts vs. BAM inputs are in any way more useful to check.
Rather than assume the answers, can someone please expound? Tagging Devon Ryan & WouterDeCoster - who answered the OP. But please, others chime in. Thank you.
if you need to do PCA or clustering take rlog values which i did after running it in deseq2
It would be much quicker to use the counts. You're also looking at two subtly different things. If you use counts, you're asking, "How do my samples separate according to their quantified gene expression?" If you instead use the BAM files as input, you're asking "How do my samples separate according to genome wide coverage?" The latter will be affected by things like RNA-degradation, while the former won't be so severely affected by that. We would typically only use genome-wide coverage as input for PCA when looking at ChIP-seq/ATAC-seq/etc., since then you're running a PCA on the input that matters for you. In RNA-seq, since you care about genes (or transcripts), then it makes the most sense to use that as input (do use the rlog of it as suggested by krushnack80).
Thanks Devon (and krushnach80) - your explanation of why counts would be better for my RNA-Seq data makes total sense. 2 quick follow-up questions:
When using counts to obtain PCA and hierarchical clustering, what are typical tools used? And is it normal to use, say only the '100 most highly expressed genes' to generate these relationships (I have close to 150 RNA-Seq libraries), as suggested by details at this link. Since the lis of such genes would vary across time and between genotypes (this is not a pairwise comparison) I hesitate to follow this subsetting idea. So should I instead use the entire counts data for each library - or for such a large dataset, are there practical constraints of time, memory etc.? (I've got a max of 64GB RAM and upto 32 cpus available)
Also, a separate question - if at all I want to look at relationships across ~ 150 BAM files (because some of my regions of interest are non-genic eg. TEs), then would some of the DeepTools utilities such as multiBamSummary, plotCorrelation, plotPCA, and bigwigCompare be of relevance, or are these tools all pertinent not for RNA-Seq but only for ChIP-Seq? But If valid for use with RNA-Seq data as well, would I be converting my BAM files to BigWig first, before following your recommendations at this archived BioStars Galaxy post?
All this is new to me, so my questions may be very naive, I apologize for that. But thanks in advance.
Use the most variable rather than the most highly expressed genes. You could use multiBigwigSummary (after making normalized bigWigs), but I'd think you'd want to quantify TE expression anyway, so you would have counts of some sort of them instead.
I have
deepTools/2.5.3
installed. When library sizes are quite variable, what is the preferred normalization method of my RNA-Seq BAM files during thebamCoverage
step, to convert them to *.bw format for themultiBigwigSummary
step - scaleFactor, RPGC or RPKM? Online, however, the 2 other options are CPM, BPM - which I do not see in my help menu. Am I due for a deepTools version upgrade to the latest 3.4.3, or can I make do with 2.5.3 (it's a small hassle on a shared HPCC)? BTW, for context - my goal is to examine BAM relationships by plotting as PCA and correlations. One of your GitHub issues discussion was illuminating, but it's from 2016. I'd prefer any updates on the matter, please. Thank you, in advance, Devon!RPGC, the online documentation you're looking at is for the current version. You might switch to using conda on your HPC so you're not reliant on the system administrators to constantly do that.
conda install was very fast and easy, thanks Devon.
But I'm confused by your suggestion of RPGC for transcriptome BAMs - because at this GitHub discussion, I thought you instructed that for RNA-Seq BAM files, the normalization type is ideally either RPKM or DESeq2's size factor. No?