I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough.
What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files.
However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to.
Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I can't simply split the BAM file into 15 files, because without luck, I could end up with a 2 BAM files that have the alignments of a single read split between them, leading to the division not being correct.
How can I instruct UNIX to count a certain number of unique reads on the BAM file, output all the alignments to a new file, and continue with the rest of the BAM file, such that all reads have their n alignments contained in one single file (but shared with other reads)?