deepTools NPZ to PCA plot with modifications
1
1
Entering edit mode
3.7 years ago

Is it possible to modify the plotPCA output of deepTools as follows?

  1. Suppress the SCREE plot completely
  2. Label the points with user supplied names that are already found in the default legend
  3. Ensure labels do not overlap (with something like ggrepel)
  4. Control font size of output currently the area of the plot is huge, but the lettering is tiny, making it hard to see it all, without zooming back and forth....

    If answers to any of questions 1-4 above is NO, then question 5. below

  5. Alternatively, which output file from deepTools' plotPCA can I use to make my own PCA plot with ggplot2? Would it be the *.tab file?

If question 5 answer is NO, then question 6. below

  1. Is there a syntax / pipeline to convert NPZ files generated by multibigwigsummary to PCA plot? We could not proceed beyond reading in a test NPZ file using the R package - reticulate.

Aappreciate any help, thank you!

deeptools plotPCA • 2.3k views
ADD COMMENT
1
Entering edit mode

Is there a syntax / pipeline to convert NPZ files generated by multibigwigsummary to PCA plot? We could not proceed beyond reading in a test NPZ file using the R package - reticulate.

Did you run into an error? If yes, which one?

Alternatively, which output file from deepTools' plotPCA can I use to make my own PCA plot with ggplot2? Would it be the *.tab file?

Yes, most likely that's the file you're looking for.

ADD REPLY
2
Entering edit mode
3.7 years ago

The npz files are large matrices of values, so you'd need to run a PCA on them in R. It's easier to run plotPCA with the --outFileNameData option. That produces a text file with the values you want to plot, so you can replot them however you'd like in R.

ADD COMMENT
0
Entering edit mode

Thank you for your clarification, Prof. Ryan.

Since our original post, and before your response here, this is what we tried.

  1. Unzip the NPZ output of deepTools' multiBigWigsummary,
  2. Read in resulting separate matrix and labels NPY files in R using the reticulate package, bind them together
  3. For PCA plot - Convert data frame to matrix, transpose this, calculate prcomp in base R, and plot using fviz_pca_ind of factoextra package,
  4. For correlogram heatmap - Simply calculate all-by-all correlations in base R using cor, and plot this matrix using any of the following R packages - heatmap, or heatmap2, or pretty heatmap or complex heatmap.

With these we could generate very nice PCA plots and correlogram heatmap plots.

But if you see that any of these steps, their order, and combination, need to be tweaked or altered, please let us know. Thanks again.

ADD REPLY
0
Entering edit mode

That all looks good. Loading NPZ/NPY files into R is a bit of a pain :)

ADD REPLY
0
Entering edit mode

Dear Prof. Ryan

We agree! Initially, input of NPZ files into R was such a pain, but the R 'reticulate' package ain't so bad, even for us R noobs :)

However, we've now managed to confuse ourselves :) Could you please help clear our confusion with perhaps some (made-up) numbers to help us understand the math behind bin size and bin counts?

It took > 1 hour to compose this post, despite that it's longish, sorry about that! But we hope you have all the info to be able to simply respond with one-liners :)

Thank you, in advance!

Pipeline used: - We used 1/(sizeFactor) of DESeq2 with your bamCoverage, to convert BAM to bigWig, across all 16 libraries (2 cell types * 2 treatments * 4 replicates each). So there are 4 conditions = 2 cell types * 2 treatments. - These sizeFactors reported by DESeq2 varied quite a bit in our opinion - 0.85 - 1.75 approx. Not sure if that is a problem? - We summarized EACH set of 4 replicate BigWig files per condition at a time, using your multiBigWigsummary - Then we followed steps 1-4 as described in our previous post, no confusion until here :)

Question: Our understanding is in the NPZ file, column = library, and row = bin, is that right? Also, is this correct - deepTools will calculate and summarize across identical number of bins i.e. rows, and across all the libraries i.e. columns, in an NPZ multiBigwigsummary file right so as to compare the same genomic intervals or bins across different libraries / BAMs / BigWigs, right? Otherwise, if # bins is different, it will become an apples vs. oranges comparison, without validity, right?

However, now we have 3 major points of confusion ensuing after this step!

Confusion # 1: Across our 4 npz multiBigwigSummary files (1 for each 'condition') the number of rows is exactly the same. Is this because we've calculated sizeFactors from DESeq2 for all 16 BAM inputs at the same time, for use in the downstream bamCoverage step? So, had we calculated sizeFactors from DESeq2, separately for 2 different cell types for example, then would # of rows i.e. # of bins be 2 different numbers for dataset split by cell types? At a rather old GitHub discussions link, you'd said - bin count = scale factor * number of reads. In this formula, is scale factor = 1 / sizeFactor, where sizeFactor is what we've calculated (in our case) using DESeq2?

Confusion # 2: The second point of confusion is how your formula bin count = scale factor * number of reads is tied into bin size defined to be --binSize, -bs = 50bp default for bamCoverage, and --binSize, -bs = 10000bp default for multiBigwigSummary.

Confusion # 3: The third and final point of confusion is whether the bins can be reported for ONLY genomic intervals of interest defined by the user - like for example a subset of genes, or just transcript-encoding regions, or just 1 of many chromosomes? Is such subsetting possible at the deepTools stage, or would it have to performed at an upstream or downstream step? I suppose RNA-Seq, by definition, would be such a subset, where transcripts arise (ideally and) predominantly from annotated / genic rather than non-genic genomic compartments? So how would one restrict bins for JUST mRNA-enoding regions, when using RNA-Seq data with deepTools' bamCoverage and multiBigwigSummary?

ADD REPLY
2
Entering edit mode

These sizeFactors reported by DESeq2 varied quite a bit in our opinion - 0.85 - 1.75 approx

That's a very normal amount.

Our understanding is in the NPZ file, column = library, and row = bin, is that right?

Correct and yes the bins are from the exact same regions in all samples (as you note, it would otherwise be an apples and oranges comparison).

Confusion #1

The number of bins is dependent on the genome size divided by the bin size (-bs with some possible rounding due to chromosomes not being an integer multiple of that bin size). The scale factor just scales the signal in the bigWigs. I think the current code divides by the scale factor, given that one needs to 1/x the values from DESeq2.

Confusion #2

The bin count is the (scaled) read coverage in a bin and not the number of bins in the output file. Thus, it is unrelated to the bin size.

Confusion #3

If you just want to use specific intervals you can use BED-file rather than bins. For a single chromosome or portion of it use the -r option.

For RNA-seq one would typically use a GTF file: multiBigwigSummary BED-file -b file1.bw file2.bw -o results.npz --BED annotation.gtf

ADD REPLY
0
Entering edit mode

Super awesome, we understood every one of your simplified and clear replies, thanks a lot! :)

One final request : could you confirm / correct our syntax below, for bamCoverage and multiBigwigSummary?

bamCoverage -b $IN.bam -o $OUT.bw --normalizeUsing None --effectiveGenomeSize $ACGTonly_NonNs.length --scaleFactor $Inverse_DESeq2_Sizefactor

Is it OK to specify binSize = 50 for multiBigwigSummary changed from the default of 10KB - this is so it can be same as the bamCoverage default binSize = 50? We'd like coverage tracks to be higher resolution, and we don't mind larger NPZ output files...

multiBigwigSummary BED-file  -b $BW.list -o $Summary.npz --smartLabels --binSize 50 --numberOfProcessors 4 --BED $annotation.gtf

Please let us know. I think this will be last of our questions :) Thanks again!

ADD REPLY
1
Entering edit mode

The npz files are just for QC purposes (PCA and so on), so I expect you're just increasing the computational cost with a bin size of 50 without any big benefit.

For your bamCoverage command, you can drop the --effectivegenomeSize option. Since you're specifying your own scaling factor and using --normalizeUsing None it gets ignored.

ADD REPLY
0
Entering edit mode

Thank you Prof. Ryan!. We are so glad to be on a free-of-cost BioStars forum, with such helpful forum members as yourself. We can learn so much by doing, rather than just sitting in classroom lectures. This is way more fun! :)

ADD REPLY

Login before adding your answer.

Traffic: 2367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6