Splitting chromosomes into contigs within BAM and GTF files
1
1
Entering edit mode
7 weeks ago
MAB ▴ 10

Hi folks,

I am working with BAMs containing aligned reads for individual samples (RNAseq data). Some unforeseen issues with the data mean that I need to build BAI indices for compatibility with the software DegNorm. However, I am dealing with an organism that has large chromosomes exceeding the 512Mb limit for the BAI index.

Accordingly, I am looking for a way to split the chromosomes into intervals (e.g., chr1:1-700,000,000 is split into chr1a:1-350,000,000; chr1b:1-350,000,000...) within each BAM file. Similarly, I need to split the chromosomes within the corresponding reference genome's GTF file. Unfortunately, most of the questions online deal with splitting BAMs into separate files by chromosome name, or else renaming chromosomes within BAMs or GTFs. I would appreciate any help.

I suspect the easiest answer is to split the reference genome and then realign the samples, but I am hopeful there is another solution that does not involve re-doing earlier data processing.

Cheers, M

samtools gtf chromosome annotation bam • 651 views
ADD COMMENT
1
Entering edit mode

an alternative approach might be to try to make DegNorm work with "CSI" BAM indexes instead of "BAI". CSI indexes support data exceeding 512Mb. it might be as easy as adjusting this line of code, which just checks for bai, you could remove the check completely for example so that it does not try to create a bai index https://github.com/jipingw/DegNorm/blob/2e3e9c950849869289b8651f8c735aba4774cff5/R/coverage_cal_functions.R#L32-L36 then you can index you BAM files with "samtools index -c" outside of the R script

ADD REPLY
0
Entering edit mode

This is a good idea and one I'll try tomorrow. It's not clear to me where the index is actually used in this function, so it may not even be necessary to make CSI indices at all after commenting out the code you've highlighted.

Edit: An index is definitely required. Will update if a CSI index works with DegNorm.

Edit2: DegNorm eventually calls on the function ScanBamParam from Rsamtools. This function does an additional check of whether end coordinates abide a maximum value of 536870912. This results in an error given my large chromosomes, and I cannot seem to open the code to remove this check. Unless anyone has some more insight into this, I am a bit uneasy about further modifying DegNorm's code and the code of other packages it depends on. But, cheers cmdcolin for the idea!

ADD REPLY
0
Entering edit mode
6 weeks ago
mmhryc • 0

This shouldn't be too hard with AWK or python with pysam, no? All you need to do is modify chromosome/RNAME and positions for each line (and modify BAM header).

Though if aligning your reads and other processing takes less than 5h I would just rerun it with a split reference. It appears simple but getting those things right usually takes more time than it seems (and there are potential unforeseen difficulties).

ADD COMMENT

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6