Question

Table 'N-Reads=F(Duplicate,Sample) ' How Can I Visualize This ?

1

Entering edit mode

10.9 years ago

Pierre Lindenbaum 161k

I'd like to visualize the impact of the duplicates in my NGS/Haloplex data ( With Haloplex, you'll get a large number of duplicates -- see Haloplex & Allele Calling )

I've extracted the number of read-pairs for each duplicate/INTERVAL (chrom:start-end) and for each sample (1.bam, 2.bam )

#INTERVAL MAX MEAN 1.bam 2.bam 3.bam 4.bam 5.bam 6.bam 7.bam 8.bam 9.bam 10.bam ....
I1 4059 120 0 120 4059 168 151 75 173 165 106 211 8 74 95 356 144 125 98 427 81 
I2 2490 78 0 90 2490 41 28 28 129 73 45 110 65 39 45 160 56 72 40 152 43 74 96 6
I3 61 1 0 0 19 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 42 0 0 0 4 0 0 0 0 0 0 0 0 0 1 
I4 2798 140 0 90 2798 94 86 60 149 97 102 152 158 65 63 225 73 93 46 261 58 76 4
I5 4405 142 0 65 2946 113 58 28 190 104 107 143 73 63 81 266 108 79 60 236 44 65
I6 10 0 0 1 10 0 0 0 2 0 0 3 0 0 0 1 1 1 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1
I7 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
I8 1204 32 0 49 1204 18 15 9 60 21 18 79 11 18 20 70 38 27 27 75 21 9 14 2 70 9 
I9 112 1 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 112 0 0 0 37 0 0 0 0 0 0 0 0 0 
(...)

I've uplodaded the data (6Mo) at: https://dl.dropboxusercontent.com/u/18871518/dup.tsv.gz

I'd like to see whether the number of read-pairs is homogeneous between each samples. Could you suggest a method to visualize that information ?

I'm not a #R programmer. I tested this (a heatmap):

T <- read.csv("in.tsv", sep="\t",header=TRUE)
T <- T[order(T$MEAN),]
T <- T[,4:ncol(T)]
M <- data.matrix(T)
png("out.png",width=1000,height=2000)
H <- heatmap(M, Rowv=NA, Colv=NA, col = cm.colors(5000), scale="column", margins=c(5,10),verbose=TRUE)
dev.off()

but I only see one color.

visualization duplicates ngs • 2.5k views

ADD COMMENT • link updated 10.9 years ago by Philippe ★ 1.9k • written 10.9 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2013-05-28

Hi Pierre,

I had a quick look at your data and the problem seems to be its distribution (most values are 0 or close to 0, some are very high). Using a log transformation will help you even if in your case there is still a big peak of values with low values.

plot(density(log2(M + 1)))

Note: I use M + 1 to avoid the conversion of 0 value to -Inf value (this can be problematic for some analyses, data visualization. One alternative is to replace -Inf values by NA). if this make sense in the case of your analysis another approach would be to remove 0 values.

Then using a heatmap you can have a better view of your data (still not ideal due to the distribution of your data but it seems informative to me). Note that I use the heatmap.2 function from the gplots package that is more customisable.

library(gplots)
heatmap.2(log2(M + 1), Rowv=NA, Colv=NA, col = colorpanel(n=80, low="cyan", mid="black", high="yellow"))

And, small R tip, you can use read.delim instead of read.csv which is an alias to read tab-delimited files (with header as a default).

T <- read.delim("in.tsv")

I hope this helped.

Philippe.