If A Read Is Clipped, What Is The Preferred Way To Make Tag Counts?
1
0
Entering edit mode
11.2 years ago
KCC ★ 4.1k

I want to write a program that converts SAM files to genome coverage (so wiggle or bedgraph format). So, my question is related to prrocessing the output of the aligner. My program would work a little bit like the genomeCoverageBed function in bedtools

genomeCoverageBed -bg -d -ibam reads.bam -g genome.csv

However, I wouldn't have to do the extra step of translating from SAM to BAM.

Now, it's reasonably straightforward to scan through a SAM file and pick out the strand and location of a tag. The length of the read can be inferred. Of course, one will often know the length of the reads anyway.

My question is how to handle the hard/soft clipping in terms of the length of the tag. Presumably, taking the clipping into account would mean dropping a few bases at the start or the end, thus having a shorter read. This would affect the tag count totals in the output to my function.

In DNA-seq, it seems like it doesn't make much sense to take clipping into account, because the location of the read is what mattered. Any feedback would be appreciated.

genome-coverage sam • 3.2k views
ADD COMMENT
1
Entering edit mode

I think of read clipping as something that is done by the aligner. Perhaps you are talking about read trimming (prior to alignment)? Could you clarify?

ADD REPLY
1
Entering edit mode
11.2 years ago

IMO if the read is clipped then the section that was clipped did not cover the genome, so it should not be accounted for in the coverage or in any other manner. I would treat it as if that particular read was shorter.

ADD COMMENT
0
Entering edit mode

I was thinking that at least in DNA-seq, we want to place the fragment. What mechanisms would cause edges of the read not to map? If this mechanism is a corruption of these bases then we could still use the number of bases to figure out how far the edge of the fragment extends. If these bases are bases appended to the edge of the read, then the number of bases is useless information.

ADD REPLY
1
Entering edit mode

genomic structural variations would be the simplest and most likely explanation.

But even if the cause of clipping were incorrectly called bases or other errors you should not extend them because with that you generate data that later you cannot distinguish from actually measured values.

ADD REPLY

Login before adding your answer.

Traffic: 1324 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6