Question: Working With Large *.Bam Files
5
gravatar for User 4133
8.9 years ago by
User 4133150
User 4133150 wrote:

Hello everyone, this is the first time I use Bio Star to ask a question, I would like to congratulate all those who write. Now, my question is: I'm working with a very large *.bam file (about 65 Gb) related to pair-end reads of an entire transcriptome. My final goal is to find translocations, so as preliminary step I'm looking for the mate-pair that appear in different chromosome. Due to the file size I can't use the .bam file as a .txt one. How do I overcome this problem? Do you know the 'pysam' module of python? Others ideas??? Thanks

paired bam file • 3.4k views
ADD COMMENTlink modified 8.5 years ago • written 8.9 years ago by User 4133150
8
gravatar for iw9oel_ad
8.9 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

Each record in the SAM/BAM file contains the reference sequence name and the mate reference sequence name. You can stream through the file looking for records where these two names are different. This will identify pairs that have their ends mapped to different chromosomes. No need for any data transformations at all; the process is the same whether your SAM/BAM file is 1Gb or 100Gb.

e.g.

samtools view myfile.bam |perl -ne '@f=split; print if $f[6] ne "=" && $f[5] >= 20'

This says print any SAM record where field 5 (mapping score) is >= 20 and field 6 (mate reference name) is not the same as the query reference name. This is just a simplified example; you might want to look at the alignment flags field too.

Be aware that the mate reference name may appear as '*' if the mate-pair fields have not been set in your BAM file. This will depend on how the BAM file was made.

Also, it's probably a bad idea to convert a BAM file that size to SAM because of disk I/O overheads; operate on a stream instead.

ADD COMMENTlink written 8.9 years ago by iw9oel_ad6.0k
2
gravatar for Pierre Lindenbaum
8.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

BAM is a binary file, so you can't use it as a .txt file.

If you use SAM, then can you just build a file with all the read-pair/chromosome and then do sort/uniq with this file ?

If memory is a problem, I would build a database [pair-id,chrom] with a SQL engine or even better, a key/value engine (berkeleyDB , etc... )

ADD COMMENTlink written 8.9 years ago by Pierre Lindenbaum121k
0
gravatar for User 4133
8.9 years ago by
User 4133150
User 4133150 wrote:

Thank you, in my question I omitted that the conversion from BAM to SAM has been done...my problem is just the memory. I will try. Thanks.

ADD COMMENTlink written 8.9 years ago by User 4133150
0
gravatar for User 4133
8.9 years ago by
User 4133150
User 4133150 wrote:

Thank you Keith James, I would have another question: how can you compute the mappig score from binary code? Can you sum over all alignment positions? And, in such case, what is the best score in your opinion?

ADD COMMENTlink written 8.9 years ago by User 4133150
1

Hi, ilwollo. Could you open a new question for this? If you want to 'Add Another Answer' to your own question, it should be to offer a solution. If you just want to reply to an answer, please use the 'add comment' link instead.

ADD REPLYlink written 8.9 years ago by iw9oel_ad6.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1300 users visited in the last hour