Count features of particulair reads in a BAM file
Entering edit mode
4.9 years ago
harmbrugge ▴ 20

I'm trying to set up a scRNA-seq pipeline and run into difficulties as my sample sizes are growing (>50k cells).

Right now I create a BAM file for every cell and proces them individually, but I would like to batch multiple cells into single BAM files to prevent large amount of files. Are there any tools available that allow counting reads against a GFF/GTF file for every read group for example. Or some other tag so that I get counts for every cell without having to create a BAM file for every single cell individually.

I'm using featureCount right know, but looked at a few different tools like htseq-count and bedtools.

Thanks in advance.

RNA-Seq bam counts • 1.6k views
Entering edit mode

I don't know the formatting of your GTF file, (in fact i've never used GTF files before), but as it's only like 25 lines of code as a SeQC module i've made something to get you started:

To use it:

  1. Download SeQC, and unpack it to a get a folder of .stat files and some other stuff.
  2. Download the gist above. Save it in the same folder as all the other .stat files and name it GTF.stat. Also make sure to change the hard-coded filepath of the GTF file to where your GTF file is stored.
  3. run pip install pysam to get pysam and pip install intervaltree to get intervaltree. I only just found out about interval tree for this demo, and i'm using it because it's simple to install and use. I'm fairly confident it's not as fast as other interval-parsing code like Devon's or even some of my own stuff in numpy, but if you know some python and it really is too slow i'm sure you can swap it out for something else :) I have nothing to compare it to, so who knows, maybe it's ok.
  4. Run python --input your.bam --analysis RGID GTF

As always with interval work i've made a lot of assumptions about what you would consider to be a "hit". For this, at least 1bp of the read has to overlap the GTF interval. This code only cares about reads, not fragments (beginning of read to end of mate), and it's very crude, with the end position of the read just being calculated as POS + len(SEQ). Really you should parse the CIGAR string and add that to the POS to get the end of the read, especially if your dealing with RNA-Seq, but how exactly to parse the CIGAR string is something i can't decide for you - so i didn't :) Maybe you want to include skips, maybe not. Maybe you want to include deletions, maybe not. Maybe when two reads in a pair are on different chromosomes you want to use just the reads, maybe you want to bin the whole fragment, etc etc. For now this is going to get you 99.999% of the way there I think.

You can do a lot more with SeQC than what you asked. Run SeQC with -h to see all the other modules. You can combine modules to do more powerful stuff in SQL later on like "Counts per GTF feature that are properly mapped in a pair, no duplicates, on chromosome 1, 4 and 8", or whatever.

Entering edit mode

It's not really intended for that, but a bit of python with the deeptoolsintervals package (just install deepTools and it'll be available) could do that. The deeptoolsintervals package allows for quick queries of overlaps and can return things like the name of the overlapped gene/feature. One could just make a hash of those per cell and efficiently be able to handle thousands of them. I could make that package a little more convenient for stuff like this (the underlying C code is already oriented toward it), but it's already usable.

BTW, posting this as a comment, since it'd be nicer if there's something premade for this.


Login before adding your answer.

Traffic: 2290 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6