Count features of particulair reads in a BAM file
0
2
Entering edit mode
4.9 years ago
harmbrugge ▴ 20

I'm trying to set up a scRNA-seq pipeline and run into difficulties as my sample sizes are growing (>50k cells).

Right now I create a BAM file for every cell and proces them individually, but I would like to batch multiple cells into single BAM files to prevent large amount of files. Are there any tools available that allow counting reads against a GFF/GTF file for every read group for example. Or some other tag so that I get counts for every cell without having to create a BAM file for every single cell individually.

I'm using featureCount right know, but looked at a few different tools like htseq-count and bedtools.

RNA-Seq bam counts • 1.6k views
0
Entering edit mode

I don't know the formatting of your GTF file, (in fact i've never used GTF files before), but as it's only like 25 lines of code as a SeQC module i've made something to get you started:

To use it:

1. Download SeQC, and unpack it to a get a folder of .stat files and some other stuff.
2. Download the gist above. Save it in the same folder as all the other .stat files and name it GTF.stat. Also make sure to change the hard-coded filepath of the GTF file to where your GTF file is stored.
3. run pip install pysam to get pysam and pip install intervaltree to get intervaltree. I only just found out about interval tree for this demo, and i'm using it because it's simple to install and use. I'm fairly confident it's not as fast as other interval-parsing code like Devon's or even some of my own stuff in numpy, but if you know some python and it really is too slow i'm sure you can swap it out for something else :) I have nothing to compare it to, so who knows, maybe it's ok.
4. Run python SeQC.js.py --input your.bam --analysis RGID GTF

You can do a lot more with SeQC than what you asked. Run SeQC with -h to see all the other modules. You can combine modules to do more powerful stuff in SQL later on like "Counts per GTF feature that are properly mapped in a pair, no duplicates, on chromosome 1, 4 and 8", or whatever.

0
Entering edit mode

It's not really intended for that, but a bit of python with the deeptoolsintervals package (just install deepTools and it'll be available) could do that. The deeptoolsintervals package allows for quick queries of overlaps and can return things like the name of the overlapped gene/feature. One could just make a hash of those per cell and efficiently be able to handle thousands of them. I could make that package a little more convenient for stuff like this (the underlying C code is already oriented toward it), but it's already usable.

BTW, posting this as a comment, since it'd be nicer if there's something premade for this.