Question

Efficiently join paired-end read coordinates in the same line?

1

Entering edit mode

9.4 years ago

daniel.soronellas ▴ 330

Dear community,

I have a huge paired-end HiC dataset (BAM format) which I want to format like this way:

HWI-D00283:117:C5KKJANXX:2:1101:1139:77789    chr6    153338506    153338556    37    +    chr6    153338031    153338081    37    -
HWI-D00283:117:C5KKJANXX:2:1101:1139:77856    chr6    149915169    149915219    37    -    chr6    149914908    149914958    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:79414    chr4    184474969    184475019    37    -    chr4    184474811    184474861    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:81280    chr6    153641723    153641773    37    -    chr6    153641551    153641601    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:81917    chr8    87070282    87070332    37    -    chr8    87069851    87069901    37    +
HWI-D00283:117:C5KKJANXX:2:1101:1139:82575    chr17    56970884    56970934    37    -    chr6    151400450    151400500    37    -
HWI-D00283:117:C5KKJANXX:2:1101:1139:86642    chr6    150043041    150043091    37    -    chr6    150042915    150042965    37    +

This is an example which I obtained by first converting the BAM format to BED and separating each mate into different files and then with a AWK command joined the mates.

This is the awk command I used:

awk 'NR==FNR {h[$4] = $1"\t"$2"\t"$3"\t"$5"\t"$6; next} {OFS="\t"; print $4,$1,$2,$3,$5,$6,h[$4]}' mate1 mate2

This command worked fine with a small dataset (1M, 10M reads), but when I tried with 200M reads file, it crashes because memory reasons I suppose. Is there a way to efficiently join paired-end reads as I showed in my example?

Thanks!

HiC paired-end 4C 3C • 2.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by daniel.soronellas ▴ 330

1

Entering edit mode

9.4 years ago

Devon Ryan 104k

If you're OK with query-sorting the BAM files first, then something like the following script using pysam should work (this is untested)

#!/usr/bin/env python
import pysam
fp = pysam.Samfile("foo.bam", "rb")
of = open("somefile.txt", "w")
for b in fp.fetch() :
    if b.mate_is_unmapped :
        continue
    if b.is_read1 :
        of.write("%s\t%s\t%i\t%i\t%i" % (b.qname, fp.getrname(b.tid), b.pos, b.aend, b.mapq))
        if b.is_reverse :
            of.write("\t-")
        else :
            of.write("\t+")
    else :
        of.write("\t%s\t%i\t%i\t%i" % (fp.getrname(b.tid), b.pos, b.aend, b.mapq))
        if b.is_reverse :
            of.write("\t-\n")
        else :
            of.write("\t+\n")
of.close()
fp.close()

ADD COMMENT • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

you should check that the reads are properly paired (<- problem with discordant pairs)

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Discordant pairs are largely the species of interest with HiC (or at least I'd think so...I've never done it).

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2014-11-13

3

Entering edit mode

9.4 years ago

Pierre Lindenbaum 161k

extract the mapped reads F, sort on name:

samtools view -f 64 -F 3976 your.bam | cut -f LC_ALL=C sort -k1,1 > F.txt

extract the mapped read R, sort on name:

samtools view -f 128 -F 3976 your.bam | LC_ALL=C sort -k1,1 > R.txt

join both files and format the output with awk:

LC_ALL=C join -t '    ' -1 1 -2 1 F.txt R.txt | awk -f your.script > result.txt

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

-f 64 -F 3912 won't work very well :) I assume you mean -f 64 -F 3976

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

yes, thanks !

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you! Worked fine!

ADD REPLY • link 9.4 years ago by daniel.soronellas ▴ 330