Filter only overlapping portion of read pairs?

0

Entering edit mode

16 months ago

Tom Ellis • 0

I am working with paired-end bisulphite-treated short-read data. I would like to extract only the portion of the reads where the two reads overlap. I have raw fastq files, and sorted bams after alignment using Bismark. Just clipping N bp from the 5 or 3 prime end isn't a robust solution because the length of the non-overlapping regions vary depending on insert size.

Can anyone suggest away to just extract the portion of read pairs that overlap with one another?

Anticipating that someone will ask: the reason I want to do this is that I'm looking at data from a new bisulphite protocol, and it looks like the methylation calls are only reliable in the area where the reads overlap. I have not worked out why.

bisulfite paired-end • 988 views

ADD COMMENT • link updated 16 months ago by Jack Tierney ▴ 350 • written 16 months ago by Tom Ellis • 0

0

Entering edit mode

Name sort the BAM file, and then using biopython (Bio.AlignIO) you can find the overlap region by taking the maximum start value, and minimum end value for the pairs. Obviously check each pair for overlap first before doing this.

ADD REPLY • link 16 months ago by rpolicastro 13k

0

Entering edit mode

Thanks. When you say 'name sort' do you mean something like samtools sort?

ADD REPLY • link 16 months ago by Tom Ellis • 0

0

Entering edit mode

Yup, I'd say so. Name sorting is done with samtools sort -n.

ADD REPLY • link 16 months ago by Jack Tierney ▴ 350

0

Entering edit mode

Also it seems that Bio.AlignIO can't handle bam files directly, so i'd have to convert to another format first. What would be a sensible format to convert to that doesn't destroy the alignment information? I am new to this and the array of formats is still confusing

ADD REPLY • link 16 months ago by Tom Ellis • 0

0

Entering edit mode

Rather than Bio.AlignIO, I would recommend parsing the file with pysam and then using get_overlap (see here) find out if consecutive reads overlap

ADD REPLY • link 16 months ago by Jack Tierney ▴ 350

Login before adding your answer.