Question: annotation of duplicate reads
gravatar for ifreecell
6.4 years ago by
ifreecell180 wrote:

Hi there,

I have a lot of duplicate reads in my RNA-seq, which is great from my view. My question is is there any way to get the coordinates of these duplicate reads ( > 10 reads) from a BAM/SAM file? It will be perfect to be able to annotate these regions with overlapping or flanking genes and output the results in a bed format. Any suggestions are appreciated, but solutions using R packages are more preferred. 

Here are some artificial example datasets (might be deleted after one year) to be tested.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by ifreecell180

If you are only interested in retrieving list of genes which have lots of duplicate reads then just compare the read count files  generated using bam file before and after removal or flagging of duplicate reads.  Read count softwares like HTseq will consider overlapping reads towards the counts. You can then compare these two count files and extract genes that show reduction of more than 10 counts between these two files. Retrieving the intergenic regions that show duplication is not that straightforward but can be accomplished using bedtools. 

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Ashutosh Pandey12k

What have you tried? It'd be possible to do this in any language with convenient access to BAM files (C, python, R, java, etc.).

ADD REPLYlink written 6.4 years ago by Devon Ryan98k

Could you just give some clue?

ADD REPLYlink written 6.4 years ago by ifreecell180

The general structure would be to read in a coordinate sorted BAM file and then create a buffer in a language of your choosing to hold one alignment. You'd then iterate through the alignments checking if a given alignment matches that in the buffer. If it does, increment a counter. If the alignment doesn't match that in the buffer, then look at the counter and see if there are enough duplicates to note the location. This would then be repeated. This is easy enough with single-end reads, but can be a bit more complicated with paired-end reads, depending on how you want to define duplicates.

ADD REPLYlink written 6.4 years ago by Devon Ryan98k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2275 users visited in the last hour