Question

How to calculate the ratio of nucleotide diversity (pi) between two population in each window

1

Entering edit mode

4.8 years ago

Hann ▴ 110

Hello,

I have two files has the calculation of nucleotide diversity (pi) for each population (I have two population) I am trying to get the ratio of PI between the two population ( PI pop1 / PI pop2), in each window and chromosome (they should have the same start in BIN_START

in another word, I want to have the ratio of Pi from the two files (divide Pi in file 1 over Pi in file 2) if the CHROM and BIN_start columns in file 1 = to CHROM and BIN_start in file 2

Example: Here is population one: the nt diversity of population 1 in each window: BIN_START ... BIN_END

CHROM   BIN_START       BIN_END N_VARIANTS      PI
Dexi_CM05836_chr01A     120001  170000  14      3.4207e-05
Dexi_CM05836_chr01A     130001  180000  54      0.000130133
Dexi_CM05836_chr01A     140001  190000  99      0.000356213
Dexi_CM05836_chr01A     150001  200000  162     0.000652928
Dexi_CM05836_chr01A     160001  210000  174     0.000690181
Dexi_CM05836_chr01A     170001  220000  179     0.000721402
Dexi_CM05836_chr01A     180001  230000  244     0.00111935
Dexi_CM05836_chr01A     190001  240000  369     0.0016301
Dexi_CM05836_chr01A     200001  250000  402     0.0017491
Dexi_CM05836_chr01A     210001  260000  454     0.00205598
Dexi_CM05836_chr01A     220001  270000  498     0.00227197

CHROM   BIN_START       BIN_END N_VARIANTS      PI
Dexi_CM05836_chr01A     20001   70000   1       1.20063e-07
Dexi_CM05836_chr01A     30001   80000   1       1.20063e-07
Dexi_CM05836_chr01A     40001   90000   1       1.20063e-07
Dexi_CM05836_chr01A     50001   100000  1       1.20063e-07
Dexi_CM05836_chr01A     60001   110000  1       1.20063e-07
Dexi_CM05836_chr01A     120001  170000  17      2.30029e-05
Dexi_CM05836_chr01A     130001  180000  51      9.37882e-05
Dexi_CM05836_chr01A     140001  190000  61      0.00010983

I am not sure how I would do that in bash ( might be for loop and awk) or R. How I should create an identifier? I need help

Thanks,

bash R population genetics • 2.8k views

ADD COMMENT • link updated 4.8 years ago by benformatics 3.9k • written 4.8 years ago by Hann ▴ 110

score 2 · Accepted Answer · 2019-07-17

I did this for fun with GRanges. You need to decide which file you want the final information to be added to (i.e. either one or two). In my example I add all possible PI ratios to the first file. Non-existent ratios are NA's.

Also this script assumes NO overlapping bins within one file and that the overlapping bins between the two files are identical. If that is not the case then the script NEEDS to be modified.

library(GenomicRanges)

## this needs to be edited to load your two files properly
file1 <- read.delim2('file1.txt',sep='\t',stringsAsFactors = FALSE)
file2 <- read.delim2('file2.txt',sep='\t',stringsAsFactors = FALSE)

## function to convert your table into GRanges
convert.to.gr <- function(table){
  gr <- GRanges(paste0(table$CHROM,":",table$BIN_START,'-',table$BIN_END))
  gr$nvar <- as.numeric(table$N_VARIANTS)
  gr$pi <- as.numeric(table$PI)
  return(gr)
}

## convert your two tables to GRanges
first.set <- convert.to.gr(file1)
second.set <- convert.to.gr(file2)

## find overlapping regions
all.overlaps <- findOverlaps(first.set,second.set,type='equal')
## extract those indices
ol.gr <- first.set[queryHits(all.overlaps)]
## calculate the ratio
ol.gr$pi_ratio <- ol.gr$pi/second.set$pi[subjectHits(all.overlaps)]

## add the pi ratio info back to original
first.set$pi_ratio <- NA
first.set$pi_ratio[queryHits(all.overlaps)] <- ol.gr$pi_ratio

## export as CSV file... note that there are two new columns strand and pi_ratio
write.csv(first.set,'pi_ratio.csv',quote=FALSE)