Question: how to find the overlapping genomic sites for two big mpileup files and make a window file around these overlapping sites
gravatar for jing.mengrabbit
18 months ago by
jing.mengrabbit30 wrote:

I have two mpileup files generated by samtools, their sizes are larger than 10GB. I would like to choose the genomic sites that are present in both mpileup files, and generate a matrix file. The matrix file has 201 columns, and the columns contain mpileup information for the genomic sites of overlapping genomic site +/-100 bp. The problem is that these two files are too big and it is impossible to read them all into memory. If every time reads N lines, there are two possible questions: 1, some genomic sites in the first mpileup file are present in the >N lines in the second mpileup. 2, It is impossible to make a matrix file for the genomic sites in the (N-100, N) lines. Also, the total matrix file will be very big. Could you please give me some hints to resolve the above problems? Thanks for your time!

alignment sequence genome • 430 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 18 months ago by jing.mengrabbit30
gravatar for finswimmer
7 months ago by
finswimmer12k wrote:

Also this is a quite old question, maybe you or other are interested in a possible way.

First let's create a bed file, containing intervals of covered site in each mpileup file. As this question is 10 month old, I assume you have the output of samtools mpileup where the first column contains the chromosome name and second the position.

$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup1 | bedtools merge > mpileup_file1.bed
$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup2 | bedtools merge > mpileup_file2.bed

Now you can intersect these to files, to find common regions:

$ bedtools intersect -a mpileup_file1.bed -b mpileup_file2.bed > mpileup_intersect.bed

To extend these common intervals you can use bedtools slop, which would require a file genome file. Or you use a quick-and-dirty awk solution:

$ awk -v OFS="\t" '{print $1, $2-100, $3+100}' mpileup_intersect.bed > mpileup_padded.bed

The next step is to compress your mpileup files using bgzip and index them with tabix.

$ bgzip -c mpileup1 > mpileup1.gz
$ bgzip -c mpileup2 > mpileup2.gz

$ tabix -s 1 -b 2 -e 2 mpileup1.gz
$ tabix -s 1 -b 2 -e 2 mpileup2.gz

Now we are able, to quickly receive the information at the positions given in our mpileup_padded.bed.

$ tabix -R mpileup_padded.bed mpileup1.gz
$ tabix -R mpileup_padded.bed mpileup2.gz

Use the output how ever you like.

fin swimmer

ADD COMMENTlink written 7 months ago by finswimmer12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 566 users visited in the last hour