Question

Is there a way to output the single counts for every region with samtools?

1

Entering edit mode

8.2 years ago

schelarina ▴ 50

Hello

Is there a way to output the single counts for every region with samtools? For example I use samtools -c file.bam chrom1:10-1000 chrom1:50-800 and it gives the sum of the counts of the two regions. How to get the counts for the two regions separately?

RNA-Seq • 1.8k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by schelarina ▴ 50

Ram · Answer 1 · 2016-02-01

1

Entering edit mode

8.2 years ago

Devon Ryan 104k

There's no way to do that directly with samtools, but a few lines of python with pysam could do it.

ADD COMMENT • link 8.2 years ago by Devon Ryan 104k

1

Entering edit mode

Not by running samtools two separate times :-)

ADD REPLY • link 8.2 years ago by GenoMax 141k

1

Entering edit mode

This would work :) But if OP wants to expand to solve the general problem of counting N regions of a BAM file, it becomes messy quickly. Better to use a different approach (like using pysam/hts-python).

Also, um, I remember trying to do this myself via samtools and having a bad time. I think, either it doesn't count reads which end in the window but start outside - or it does - i can't remember. Regardless, it's ambiguous and the OP should look into it (if they haven't already). If my memory serves me correctly and samtools doesn't count reads which start and end in the region, and the OP wants that, they'll have to use something like python.

If they are doing Paired End sequencing, he/she also needs to think about the DNA between reads (not counting read/mate twice - counting stuff in between even if both read and mate aren't within the region, etc) then they'll have to pile that stuff up and count it. Good times.

ADD REPLY • link 8.2 years ago by John 13k

1

Entering edit mode

BTW, samtools view -c will include alignments starting/ending within a region.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

It will include alignments that span over, but don't directly align in, a bin as well. For example:

$ samtools view -h foo.bam
@SQ	SN:1	LN:1000000
spliced_read	0	1	1	40	5M1000N5M	*	0	0	ACGTACGTAC	FFFFFFFFFF	NM:i:0
$ samtools view -c foo.bam 1:100-200
1

This is handled in things like mpileup (and depth), but not the default -c option code. Of course this only matters with spliced data.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

Ahh, ok got yah :) Thanks for clearing that up for me.
I think I was doing it for P.E. ChIP data, which I guess can be thought of as a single spliced fragment ;)

Hm. Come to think of it, it would be kind of neat if there was a way to turn P.E. into S.E. (modifying the CIGAR to treat it as a splice), because then tools which read the data don't have to match reads up in-memory. A lot of my run-through-a-bam-and-calculate-statistics tools in pysam have to store the read in memory and wait until it comes across the mate before it can calculate the stat. About 80-90% of my memory usage is spent keeping track of reads without mates (because if the mate doesn't map, it's at the end of the BAM when sorted on POS). Single-Ending the reads would be one nice way to tidy that problem up... kind of off-topic though.

ADD REPLY • link 8.2 years ago by John 13k

1

Entering edit mode

You have the TLEN field, just ignore the downstream mate and you needn't blow up the memory. Note that this won't work if you're fetching reads near a region first, since reads that would be extended into a region wouldn't get fetched (there's no good way around that, this is due to how the BAM format works).

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yeah, that's true - you'd have to pile up everything in the chromosome I guess. But sometimes there are issues where from the read you cant infer everything about the mate, which was why I was thinking about PE -> SE. I made this little diagram to wrap my head around what values you can/cant get from the mate:

The result is that some statistics like mate-length, or mate 3' end, depending on the read type, can't be inferred without knowing the CIGAR/SEQ of the mate (I call PSEQ, but that doesnt exist). Now that I think about it, you could add PSEQ in as a tag and fix this gap. Meh.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by John 13k