Is there a way to output the single counts for every region with samtools?
1
1
Entering edit mode
5.8 years ago
schelarina ▴ 30

hello

Is there a way to output the single counts for every region with samtools?

For example  I use

samtools -c file.bam chrom1:10-1000 chrom1:50-800

and it gives the sum of the counts of the two regions

how to get the counts for the two regions separately?

RNA-Seq • 1.1k views
1
Entering edit mode
5.8 years ago

There's no way to do that directly with samtools, but a few lines of python with pysam could do it.

1
Entering edit mode

Not by running samtools two separate times :-)

1
Entering edit mode

This would work :) But if OP wants to expand to solve the general problem of counting N regions of a BAM file, it becomes messy quickly. Better to use a different approach (like using pysam/hts-python).

Also, um, I remember trying to do this myself via samtools and having a bad time. I think, either it doesn't count reads which end in the window but start outside - or it does - i can't remember. Regardless, it's ambiguous and the OP should look into it (if they haven't already). If my memory serves me correctly and samtools doesn't count reads which start and end in the region, and the OP wants that, they'll have to use something like python.

If they are doing Paired End sequencing, he/she also needs to think about the DNA between reads (not counting read/mate twice - counting stuff in between even if both read and mate aren't within the region, etc) then they'll have to pile that stuff up and count it. Good times.

1
Entering edit mode

BTW, samtools view -c will include alignments starting/ending within a region.

0
Entering edit mode

It will include alignments that span over, but don't directly align in, a bin as well. For example:

$samtools view -h foo.bam @SQ SN:1 LN:1000000 spliced_read 0 1 1 40 5M1000N5M * 0 0 ACGTACGTAC FFFFFFFFFF NM:i:0$ samtools view -c foo.bam 1:100-200
1


This is handled in things like mpileup (and depth), but not the default -c option code. Of course this only matters with spliced data.

0
Entering edit mode

Ahh, ok got yah :) Thanks for clearing that up for me.
I think I was doing it for P.E. ChIP data, which I guess can be thought of as a single spliced fragment ;)

Hm. Come to think of it, it would be kind of neat if there was a way to turn P.E. into S.E. (modifying the CIGAR to treat it as a splice), because then tools which read the data don't have to match reads up in-memory. A lot of my run-through-a-bam-and-calculate-statistics tools in pysam have to store the read in memory and wait until it comes across the mate before it can calculate the stat. About 80-90% of my memory usage is spent keeping track of reads without mates (because if the mate doesn't map, it's at the end of the BAM when sorted on POS). Single-Ending the reads would be one nice way to tidy that problem up... kind of off-topic though.

1
Entering edit mode

You have the TLEN field, just ignore the downstream mate and you needn't blow up the memory. Note that this won't work if you're fetching reads near a region first, since reads that would be extended into a region wouldn't get fetched (there's no good way around that, this is due to how the BAM format works).

0
Entering edit mode

Yeah, that's true - you'd have to pile up everything in the chromosome I guess. But sometimes there are issues where from the read you cant infer everything about the mate, which was why I was thinking about PE -> SE. I made this little diagram to wrap my head around what values you can/cant get from the mate:

The result is that some statistics like mate-length, or mate 3' end, depending on the read type, can't be inferred without knowing the CIGAR/SEQ of the mate (I call PSEQ, but that doesnt exist). Now that I think about it, you could add PSEQ in as a tag and fix this gap. Meh.