Question

how to find the overlapping genomic sites for two big mpileup files and make a window file around these overlapping sites

0

Entering edit mode

6.2 years ago

jing.mengrabbit ▴ 60

I have two mpileup files generated by samtools, their sizes are larger than 10GB. I would like to choose the genomic sites that are present in both mpileup files, and generate a matrix file. The matrix file has 201 columns, and the columns contain mpileup information for the genomic sites of overlapping genomic site +/-100 bp. The problem is that these two files are too big and it is impossible to read them all into memory. If every time reads N lines, there are two possible questions: 1, some genomic sites in the first mpileup file are present in the >N lines in the second mpileup. 2, It is impossible to make a matrix file for the genomic sites in the (N-100, N) lines. Also, the total matrix file will be very big. Could you please give me some hints to resolve the above problems? Thanks for your time!

sequence alignment genome • 1.1k views

ADD COMMENT • link updated 5.3 years ago by Biostar 20 • written 6.2 years ago by jing.mengrabbit ▴ 60

score 1 · Answer 1 · 2019-01-11

Also this is a quite old question, maybe you or other are interested in a possible way.

First let's create a bed file, containing intervals of covered site in each mpileup file. As this question is 10 month old, I assume you have the output of samtools mpileup where the first column contains the chromosome name and second the position.

$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup1 | bedtools merge > mpileup_file1.bed
$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup2 | bedtools merge > mpileup_file2.bed

Now you can intersect these to files, to find common regions:

$ bedtools intersect -a mpileup_file1.bed -b mpileup_file2.bed > mpileup_intersect.bed

To extend these common intervals you can use bedtools slop, which would require a file genome file. Or you use a quick-and-dirty awk solution:

$ awk -v OFS="\t" '{print $1, $2-100, $3+100}' mpileup_intersect.bed > mpileup_padded.bed

The next step is to compress your mpileup files using bgzip and index them with tabix.

$ bgzip -c mpileup1 > mpileup1.gz
$ bgzip -c mpileup2 > mpileup2.gz

$ tabix -s 1 -b 2 -e 2 mpileup1.gz
$ tabix -s 1 -b 2 -e 2 mpileup2.gz

Now we are able, to quickly receive the information at the positions given in our mpileup_padded.bed.

$ tabix -R mpileup_padded.bed mpileup1.gz
$ tabix -R mpileup_padded.bed mpileup2.gz

Use the output how ever you like.

fin swimmer