Hello everyone, this is the first time I use Bio Star to ask a question, I would like to congratulate all those who write. Now, my question is: I'm working with a very large *.bam file (about 65 Gb) related to pair-end reads of an entire transcriptome. My final goal is to find translocations, so as preliminary step I'm looking for the mate-pair that appear in different chromosome. Due to the file size I can't use the .bam file as a .txt one. How do I overcome this problem? Do you know the 'pysam' module of python? Others ideas??? Thanks
Each record in the SAM/BAM file contains the reference sequence name and the mate reference sequence name. You can stream through the file looking for records where these two names are different. This will identify pairs that have their ends mapped to different chromosomes. No need for any data transformations at all; the process is the same whether your SAM/BAM file is 1Gb or 100Gb.
samtools view myfile.bam |perl -ne '@f=split; print if $f ne "=" && $f >= 20'
This says print any SAM record where field 5 (mapping score) is >= 20 and field 6 (mate reference name) is not the same as the query reference name. This is just a simplified example; you might want to look at the alignment flags field too.
Be aware that the mate reference name may appear as
* if the mate-pair fields have not been set in your BAM file. This will depend on how the BAM file was made.
Also, it's probably a bad idea to convert a BAM file that size to SAM because of disk I/O overheads; operate on a stream instead.
BAM is a binary file, so you can't use it as a .txt file.
If you use SAM, then can you just build a file with all the read-pair/chromosome and then do sort/uniq with this file ?
If memory is a problem, I would build a database [pair-id,chrom] with a SQL engine or even better, a key/value engine (berkeleyDB , etc... )