Question

Working With Large *.Bam Files

5

Entering edit mode

13.7 years ago

User 4133 ▴ 150

Hello everyone,

This is the first time I use Bio Star to ask a question, I would like to congratulate all those who write.

Now, my question is: I'm working with a very large *.bam file (about 65 Gb) related to pair-end reads of an entire transcriptome. My final goal is to find translocations, so as preliminary step I'm looking for the mate-pair that appear in different chromosome.

Due to the file size I can't use the .bam file as a .txt one. How do I overcome this problem? Do you know the 'pysam' module of python? Others ideas?

Thanks

bam • 5.6k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 13.7 years ago by User 4133 ▴ 150

Ram · Answer 1 · 2010-08-16

Each record in the SAM/BAM file contains the reference sequence name and the mate reference sequence name. You can stream through the file looking for records where these two names are different. This will identify pairs that have their ends mapped to different chromosomes. No need for any data transformations at all; the process is the same whether your SAM/BAM file is 1Gb or 100Gb.

e.g.

samtools view myfile.bam |perl -ne '@f=split; print if $f[6] ne "=" && $f[5] >= 20'

This says print any SAM record where field 5 (mapping score) is >= 20 and field 6 (mate reference name) is not the same as the query reference name. This is just a simplified example; you might want to look at the alignment flags field too.

Be aware that the mate reference name may appear as * if the mate-pair fields have not been set in your BAM file. This will depend on how the BAM file was made.

Also, it's probably a bad idea to convert a BAM file that size to SAM because of disk I/O overheads; operate on a stream instead.

score 2 · Answer 2 · 2010-08-16

BAM is a binary file, so you can't use it as a .txt file.

If you use SAM, then can you just build a file with all the read-pair/chromosome and then do sort/uniq with this file ?

If memory is a problem, I would build a database [pair-id,chrom] with a SQL engine or even better, a key/value engine (berkeleyDB , etc... )

score 0 · Answer 3 · 2010-08-16

0

Entering edit mode

13.7 years ago

User 4133 ▴ 150

Thank you, in my question I omitted that the conversion from BAM to SAM has been done...my problem is just the memory. I will try. Thanks.

ADD COMMENT • link 13.7 years ago by User 4133 ▴ 150

score 0 · Answer 4 · 2010-08-16

0

Entering edit mode

13.7 years ago

User 4133 ▴ 150

Thank you Keith James, I would have another question: how can you compute the mappig score from binary code? Can you sum over all alignment positions? And, in such case, what is the best score in your opinion?

ADD COMMENT • link 13.7 years ago by User 4133 ▴ 150

1

Entering edit mode

Hi, ilwollo. Could you open a new question for this? If you want to 'Add Another Answer' to your own question, it should be to offer a solution. If you just want to reply to an answer, please use the 'add comment' link instead.

ADD REPLY • link 13.7 years ago by biobot 0.0.77.a.1099 6.2k