Table 'N-Reads=F(Duplicate,Sample) ' How Can I Visualize This ?
1
1
Entering edit mode
10.9 years ago

I'd like to visualize the impact of the duplicates in my NGS/Haloplex data ( With Haloplex, you'll get a large number of duplicates -- see Haloplex & Allele Calling )

I've extracted the number of read-pairs for each duplicate/INTERVAL (chrom:start-end) and for each sample (1.bam, 2.bam )

#INTERVAL MAX MEAN 1.bam 2.bam 3.bam 4.bam 5.bam 6.bam 7.bam 8.bam 9.bam 10.bam ....
I1 4059 120 0 120 4059 168 151 75 173 165 106 211 8 74 95 356 144 125 98 427 81 
I2 2490 78 0 90 2490 41 28 28 129 73 45 110 65 39 45 160 56 72 40 152 43 74 96 6
I3 61 1 0 0 19 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 42 0 0 0 4 0 0 0 0 0 0 0 0 0 1 
I4 2798 140 0 90 2798 94 86 60 149 97 102 152 158 65 63 225 73 93 46 261 58 76 4
I5 4405 142 0 65 2946 113 58 28 190 104 107 143 73 63 81 266 108 79 60 236 44 65
I6 10 0 0 1 10 0 0 0 2 0 0 3 0 0 0 1 1 1 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1
I7 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
I8 1204 32 0 49 1204 18 15 9 60 21 18 79 11 18 20 70 38 27 27 75 21 9 14 2 70 9 
I9 112 1 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 112 0 0 0 37 0 0 0 0 0 0 0 0 0 
(...)

I've uplodaded the data (6Mo) at: https://dl.dropboxusercontent.com/u/18871518/dup.tsv.gz

I'd like to see whether the number of read-pairs is homogeneous between each samples. Could you suggest a method to visualize that information ?

I'm not a #R programmer. I tested this (a heatmap):

T <- read.csv("in.tsv", sep="\t",header=TRUE)
T <- T[order(T$MEAN),]
T <- T[,4:ncol(T)]
M <- data.matrix(T)
png("out.png",width=1000,height=2000)
H <- heatmap(M, Rowv=NA, Colv=NA, col = cm.colors(5000), scale="column", margins=c(5,10),verbose=TRUE)
dev.off()

but I only see one color.

visualization duplicates ngs • 2.5k views
ADD COMMENT
1
Entering edit mode
10.9 years ago
Philippe ★ 1.9k

Hi Pierre,

I had a quick look at your data and the problem seems to be its distribution (most values are 0 or close to 0, some are very high). Using a log transformation will help you even if in your case there is still a big peak of values with low values.

plot(density(log2(M + 1)))

Note: I use M + 1 to avoid the conversion of 0 value to -Inf value (this can be problematic for some analyses, data visualization. One alternative is to replace -Inf values by NA). if this make sense in the case of your analysis another approach would be to remove 0 values.

Then using a heatmap you can have a better view of your data (still not ideal due to the distribution of your data but it seems informative to me). Note that I use the heatmap.2 function from the gplots package that is more customisable.

library(gplots)
heatmap.2(log2(M + 1), Rowv=NA, Colv=NA, col = colorpanel(n=80, low="cyan", mid="black", high="yellow"))

And, small R tip, you can use read.delim instead of read.csv which is an alias to read tab-delimited files (with header as a default).

T <- read.delim("in.tsv")

I hope this helped.

Philippe.

ADD COMMENT

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6