Question: how to find the overlapping genomic sites for two big mpileup files and make a window file around these overlapping sites
0
gravatar for jing.mengrabbit
12 months ago by
jing.mengrabbit30 wrote:

I have two mpileup files generated by samtools, their sizes are larger than 10GB. I would like to choose the genomic sites that are present in both mpileup files, and generate a matrix file. The matrix file has 201 columns, and the columns contain mpileup information for the genomic sites of overlapping genomic site +/-100 bp. The problem is that these two files are too big and it is impossible to read them all into memory. If every time reads N lines, there are two possible questions: 1, some genomic sites in the first mpileup file are present in the >N lines in the second mpileup. 2, It is impossible to make a matrix file for the genomic sites in the (N-100, N) lines. Also, the total matrix file will be very big. Could you please give me some hints to resolve the above problems? Thanks for your time!

alignment sequence genome • 327 views
ADD COMMENTlink modified 23 days ago by Biostar ♦♦ 20 • written 12 months ago by jing.mengrabbit30
1
gravatar for finswimmer
5 weeks ago by
finswimmer10k
Germany
finswimmer10k wrote:

Also this is a quite old question, maybe you or other are interested in a possible way.

First let's create a bed file, containing intervals of covered site in each mpileup file. As this question is 10 month old, I assume you have the output of samtools mpileup where the first column contains the chromosome name and second the position.

$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup1 | bedtools merge > mpileup_file1.bed
$ awk -v OFS="\t" '{print $1,$2-1,$2}' mpileup2 | bedtools merge > mpileup_file2.bed

Now you can intersect these to files, to find common regions:

$ bedtools intersect -a mpileup_file1.bed -b mpileup_file2.bed > mpileup_intersect.bed

To extend these common intervals you can use bedtools slop, which would require a file genome file. Or you use a quick-and-dirty awk solution:

$ awk -v OFS="\t" '{print $1, $2-100, $3+100}' mpileup_intersect.bed > mpileup_padded.bed

The next step is to compress your mpileup files using bgzip and index them with tabix.

$ bgzip -c mpileup1 > mpileup1.gz
$ bgzip -c mpileup2 > mpileup2.gz

$ tabix -s 1 -b 2 -e 2 mpileup1.gz
$ tabix -s 1 -b 2 -e 2 mpileup2.gz

Now we are able, to quickly receive the information at the positions given in our mpileup_padded.bed.

$ tabix -R mpileup_padded.bed mpileup1.gz
$ tabix -R mpileup_padded.bed mpileup2.gz

Use the output how ever you like.

fin swimmer

ADD COMMENTlink written 5 weeks ago by finswimmer10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1294 users visited in the last hour