how to find the overlapping genomic sites for two big mpileup files and make a window file around these overlapping sites
1
0
Entering edit mode
6.2 years ago

I have two mpileup files generated by samtools, their sizes are larger than 10GB. I would like to choose the genomic sites that are present in both mpileup files, and generate a matrix file. The matrix file has 201 columns, and the columns contain mpileup information for the genomic sites of overlapping genomic site +/-100 bp. The problem is that these two files are too big and it is impossible to read them all into memory. If every time reads N lines, there are two possible questions: 1, some genomic sites in the first mpileup file are present in the >N lines in the second mpileup. 2, It is impossible to make a matrix file for the genomic sites in the (N-100, N) lines. Also, the total matrix file will be very big. Could you please give me some hints to resolve the above problems? Thanks for your time!

sequence alignment genome • 1.1k views
ADD COMMENT
1
Entering edit mode
5.3 years ago

Also this is a quite old question, maybe you or other are interested in a possible way.

First let's create a bed file, containing intervals of covered site in each mpileup file. As this question is 10 month old, I assume you have the output of samtools mpileup where the first column contains the chromosome name and second the position.

$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup1 | bedtools merge > mpileup_file1.bed
$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup2 | bedtools merge > mpileup_file2.bed

Now you can intersect these to files, to find common regions:

$ bedtools intersect -a mpileup_file1.bed -b mpileup_file2.bed > mpileup_intersect.bed

To extend these common intervals you can use bedtools slop, which would require a file genome file. Or you use a quick-and-dirty awk solution:

$ awk -v OFS="\t" '{print $1, $2-100, $3+100}' mpileup_intersect.bed > mpileup_padded.bed

The next step is to compress your mpileup files using bgzip and index them with tabix.

$ bgzip -c mpileup1 > mpileup1.gz
$ bgzip -c mpileup2 > mpileup2.gz

$ tabix -s 1 -b 2 -e 2 mpileup1.gz
$ tabix -s 1 -b 2 -e 2 mpileup2.gz

Now we are able, to quickly receive the information at the positions given in our mpileup_padded.bed.

$ tabix -R mpileup_padded.bed mpileup1.gz
$ tabix -R mpileup_padded.bed mpileup2.gz

Use the output how ever you like.

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 2908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6