Question: How to find the differences in aligned bam files
1
gravatar for krushnach80
3.4 years ago by
krushnach80810
krushnach80810 wrote:

I have a set of RNA -seq data , I have aligned them ,have the accepted.bam file, to view the coverage i'm using the IGV tools but it seems there is something wrong with the sample or sequencing if I see there are differences kind stark differences between the replicates at certain places both of them are control.Is there a way to do absolute quantification of both the aligned files .Since both of them are bam files , is it possible to compare both the bam files and get the difference.

Any suggestion or help would be highly appreciated

rna-seq • 1.5k views
ADD COMMENTlink written 3.4 years ago by krushnach80810

Probably, but you'll need to exactly define what you want to compare.

ADD REPLYlink written 3.4 years ago by Devon Ryan96k

I just want to make sure like there are no major differences in the replicates overall. I read about deeptools which has this bamCompare function , can i use that to see if there is a differences ?

ADD REPLYlink written 3.4 years ago by krushnach80810

If you really want to do this with the BAM files rather than the counts, then use multiBamSummary from deepTools followed by plotPCA or similar.

ADD REPLYlink written 3.4 years ago by Devon Ryan96k

okay i dud use the bam compare but now i will give multiBamSummary a try and PCA

ADD REPLYlink written 3.4 years ago by krushnach80810

Do you want to count reads per gene?

ADD REPLYlink written 3.4 years ago by WouterDeCoster44k

yes I would like to do that counts read per gene from both the set

ADD REPLYlink written 3.4 years ago by krushnach80810

Then you can use software like htseq-count and featureCounts.

ADD REPLYlink written 3.4 years ago by WouterDeCoster44k

yes I have taken out read counts using featurecounts .So what would be the next step?

ADD REPLYlink written 3.4 years ago by krushnach80810

Differential expression analysis using e.g. DESeq2.

ADD REPLYlink written 3.4 years ago by WouterDeCoster44k

This is an old post, but my question is slightly different - what are the relative pros and cons of using BAM inputs versus count inputs for PCA or multi-cluster to check across RNA-Seq samples (all of them or biol. reps. or whatever)?

I can imagine one of the factors being diskspace << counts input compared to BAM inputs, but I've not empirically yet checked if the PCA or clustering calculations are faster and/or require less RAM requirements for count inputs vs. BAM inputs.

The other question is whether results from counts vs. BAM inputs are in any way more useful to check.

Rather than assume the answers, can someone please expound? Tagging Devon Ryan & WouterDeCoster - who answered the OP. But please, others chime in. Thank you.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Anand Rao310
2

if you need to do PCA or clustering take rlog values which i did after running it in deseq2

ADD REPLYlink written 4 weeks ago by krushnach80810
2

It would be much quicker to use the counts. You're also looking at two subtly different things. If you use counts, you're asking, "How do my samples separate according to their quantified gene expression?" If you instead use the BAM files as input, you're asking "How do my samples separate according to genome wide coverage?" The latter will be affected by things like RNA-degradation, while the former won't be so severely affected by that. We would typically only use genome-wide coverage as input for PCA when looking at ChIP-seq/ATAC-seq/etc., since then you're running a PCA on the input that matters for you. In RNA-seq, since you care about genes (or transcripts), then it makes the most sense to use that as input (do use the rlog of it as suggested by krushnack80).

ADD REPLYlink written 4 weeks ago by Devon Ryan96k

Thanks Devon (and krushnach80) - your explanation of why counts would be better for my RNA-Seq data makes total sense. 2 quick follow-up questions:

When using counts to obtain PCA and hierarchical clustering, what are typical tools used? And is it normal to use, say only the '100 most highly expressed genes' to generate these relationships (I have close to 150 RNA-Seq libraries), as suggested by details at this link. Since the lis of such genes would vary across time and between genotypes (this is not a pairwise comparison) I hesitate to follow this subsetting idea. So should I instead use the entire counts data for each library - or for such a large dataset, are there practical constraints of time, memory etc.? (I've got a max of 64GB RAM and upto 32 cpus available)

Also, a separate question - if at all I want to look at relationships across ~ 150 BAM files (because some of my regions of interest are non-genic eg. TEs), then would some of the DeepTools utilities such as multiBamSummary, plotCorrelation, plotPCA, and bigwigCompare be of relevance, or are these tools all pertinent not for RNA-Seq but only for ChIP-Seq? But If valid for use with RNA-Seq data as well, would I be converting my BAM files to BigWig first, before following your recommendations at this archived BioStars Galaxy post?

All this is new to me, so my questions may be very naive, I apologize for that. But thanks in advance.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Anand Rao310
1

Use the most variable rather than the most highly expressed genes. You could use multiBigwigSummary (after making normalized bigWigs), but I'd think you'd want to quantify TE expression anyway, so you would have counts of some sort of them instead.

ADD REPLYlink written 4 weeks ago by Devon Ryan96k

I have deepTools/2.5.3 installed. When library sizes are quite variable, what is the preferred normalization method of my RNA-Seq BAM files during the bamCoverage step, to convert them to *.bw format for the multiBigwigSummary step - scaleFactor, RPGC or RPKM? Online, however, the 2 other options are CPM, BPM - which I do not see in my help menu. Am I due for a deepTools version upgrade to the latest 3.4.3, or can I make do with 2.5.3 (it's a small hassle on a shared HPCC)? BTW, for context - my goal is to examine BAM relationships by plotting as PCA and correlations. One of your GitHub issues discussion was illuminating, but it's from 2016. I'd prefer any updates on the matter, please. Thank you, in advance, Devon!

ADD REPLYlink modified 20 days ago • written 20 days ago by Anand Rao310
1

RPGC, the online documentation you're looking at is for the current version. You might switch to using conda on your HPC so you're not reliant on the system administrators to constantly do that.

ADD REPLYlink written 20 days ago by Devon Ryan96k

conda install was very fast and easy, thanks Devon.

But I'm confused by your suggestion of RPGC for transcriptome BAMs - because at this GitHub discussion, I thought you instructed that for RNA-Seq BAM files, the normalization type is ideally either RPKM or DESeq2's size factor. No?

ADD REPLYlink written 20 days ago by Anand Rao310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour