How to produce a hom/het snp frequency table
1
0
Entering edit mode
9.3 years ago
rob234king ▴ 610

I have snps across different isolates that are either heterozygous or homozygous. What I want to do is convert the data to a hom frequency table that (total hom/no. het + hom) say every 10 million bases and produce a heatmap with each isolate a band on the y axis and on the X axis homozygous snp frequency. Any pointers how could do this in R or even python. Seems like would have been done before but my R skills are low. I can work on a tutorial on heatmap2 in R but not sure how to produce the frequency table?

Example data:

1 = yes, 0 = no.

lib        het  hom  snp_position
LIB10000   1    0    917206
LIB10000   1    0    917912
LIB10000   1    0    2703436
LIB10000   1    0    2736063
LIB10000   0    1    3843431
LIB10000   1    0    5195338
LIB10000   1    0    8054844
LIB10000   1    0    8108156
LIB10000   0    1    8685923
LIB10000   0    1    8983713
LIB10000   0    1    8984241
LIB10000   1    0    9391014
LIB10000   1    0    9658660
LIB10000   1    0    12116052
LIB10000   1    0    15798269
LIB10000   0    1    19809883
LIB10000   0    1    25505855
LIB10000   0    1    25541608
LIB10000   1    0    26855440
LIB10000   1    0    27136672
LIB10000   0    1    28417750
LIB10000   1    0    29906291
LIB10000   0    1    41573928
LIB10000   0    1    55496549
LIB10000   1    0    55651887
LIB10000   1    0    59141554


Output example:

lib       bp        hom (%)
LIB10000  10000000  28.5714285714286
LIB10000  20000000  33.3333333333333
LIB10000  30000000  50.0
LIB10000  40000000  0.0

SNP • 3.4k views
0
Entering edit mode

what does the bp mean in your output example? is it the snp_position? Can you give example input and example output based on that input?

0
Entering edit mode

The example output is what would want. The bp is 10,000,000 bp along the chromosome so how many snps are within first 10,000,000 bp = 13, and 4 are hom so 4/13*100 = 30.7 so the first one is a little off byt the other two are correct 33% and 50%. The snp position says where it is on the chromosome so can count up how many in the window. hope this helps.

1
Entering edit mode
9.3 years ago
Ying W ★ 4.3k

There's two ways I can think of to do this in R. The simple way is to write a function in R that will loop through have a counter that keeps track of where the current 10Mb chunk ends and also a running total of het and hom. When pos in row is greater than that current 10Mb chunk calculate an average based on running total of het and hom.

The more advanced way is to bin the positions using something like round_any(pos_vector, 10000000, f = = ceiling) followed by using tapply(hom_vector, pos_vector, mean)

Example of 2nd way to do it:

example = matrix(c(27136672, 28417750, 29906291, 41573928), nrow=4, ncol=1)
example = cbind(example, c(0,1,0,1))
colnames(example) = c("pos", "hom")
example
pos hom
[1,] 27136672   0
[2,] 28417750   1
[3,] 29906291   0
[4,] 41573928   1

example[,1] = round_any(example[,1], 10000000, f = ceiling) #bin by 10M
tapply(example[,2], example[,1], mean)
3e+07     5e+07
0.3333333 1.0000000


Keep in mind this second method will skip bins where there is no data (such as 40M in example above)

0
Entering edit mode

thanks very much, it's something to work with. I have a number of isolates with few snps so may have to remove those below a total snp threshold or add another dimension to account for snp density. Any ideas how to plot this table giving thresholds of percentage hom different colours please let me know. I think this package may do it "ChromHeatMap" but being an R novice, have to see how easy it is to work with using the tutorial.

0
Entering edit mode

the heatmap function in R could be used or you could just use the image function in R to kind of just kludge what you want to see, I have some sample code that would do it. However, I found the easiest way is often to just put all the values in excel and use conditional formatting (use color scales)