Counting soft clipped bases and reads
2
0
Entering edit mode
6.0 years ago

Hello,

I'm looking for a way to count the number of soft clipped bases and reads in my bam file and calculate their fraction. What's the best way to do it?

Thanks a lot.

fin swimmer

bam • 4.1k views
ADD COMMENT
3
Entering edit mode
6.0 years ago

using bioalcidaejdk : http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

and the following code

final long counts[]=new long[4];
stream().forEach(R->{
    counts[0]++;
    if(R.getReadUnmappedFlag()) return;
    Cigar cigar = R.getCigar();
    if(cigar==null || cigar.isEmpty()) return;
    counts[1]++;
    counts[2] += cigar.getReadLength();
    counts[3] += cigar.
        getCigarElements().
        stream().
        filter(C->C.getOperator().isClipping()).
        mapToInt(C->C.getLength()).
        sum();
    });

println("NUM-READS:"+counts[0]);
println("NUM-MAPPED-READS:"+counts[1]);
println("SUM-MAPPED-READS-LENGTH:"+counts[2]);
println("SUM-CLIPPING:"+counts[3]);

usage/example:

$ java -jar dist/bioalcidaejdk.jar --nocode -f biostar.code src/test/resources/S1.bam 

NUM-READS:1998
NUM-MAPPED-READS:1998
SUM-MAPPED-READS-LENGTH:139860
SUM-CLIPPING:123
ADD COMMENT
0
Entering edit mode

Hello Pierre,

"SUM-CLIPPING" is the sum of clipped bases? If so, how can I modify the code that I get the sum of clipped reads in addition?

fin swimmer

ADD REPLY
0
Entering edit mode

that I get the sum of clipped reads in addition?

final long counts[]=new long[5];
(...)
if(cigar.isClipped()) counts[4]++;
(...)
println("NUM-READS-CLIPPED:"+counts[4]);
ADD REPLY
1
Entering edit mode

Works very well.

Thanks a lot.

ADD REPLY
3
Entering edit mode
6.0 years ago
d-cameron ★ 2.9k

The GRIDSS package contains a picard-like tool gridss.CollectCigarMetrics which will calculate and summarise the the frequency of each CIGAR element length and operation. Your fraction can then be trivially calculated using read count, and number of reads with 0 soft clip CIGAR elements.

ADD COMMENT
0
Entering edit mode

Nice package. Is there a way to run just CollectCigarMetrics instead of the whole CallVariants pipeline?

fin swimmer

ADD REPLY
1
Entering edit mode

I've designed it so every step can be run as a stand-alone program. Just run java -cp gridss.jar gridss.analysis.CollectCigarMetrics. example_pipeline.sh has an example of using CollectGridssMetrics but CollectCigarMetrics works the same way.

ADD REPLY

Login before adding your answer.

Traffic: 2901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6