Question

annotation of duplicate reads

1

Entering edit mode

9.7 years ago

ifreecell ▴ 220

Hi there,

I have a lot of duplicate reads in my RNA-seq, which is great from my view. My question is is there any way to get the coordinates of these duplicate reads (> 10 reads) from a BAM/SAM file? It will be perfect to be able to annotate these regions with overlapping or flanking genes and output the results in a bed format. Any suggestions are appreciated, but solutions using R packages are more preferred.

Here are some artificial example datasets (might be deleted after one year) to be tested.

bed RNA-Seq BAM annotation duplicate-reads • 3.0k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by ifreecell ▴ 220

1

Entering edit mode

If you are only interested in retrieving list of genes which have lots of duplicate reads then just compare the read count files generated using bam file before and after removal or flagging of duplicate reads. Read count softwares like HTseq will consider overlapping reads towards the counts. You can then compare these two count files and extract genes that show reduction of more than 10 counts between these two files. Retrieving the intergenic regions that show duplication is not that straightforward but can be accomplished using bedtools.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

What have you tried? It'd be possible to do this in any language with convenient access to BAM files (C, python, R, java, etc.).

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Could you just give some clue?

ADD REPLY • link 9.7 years ago by ifreecell ▴ 220

1

Entering edit mode

The general structure would be to read in a coordinate sorted BAM file and then create a buffer in a language of your choosing to hold one alignment. You'd then iterate through the alignments checking if a given alignment matches that in the buffer. If it does, increment a counter. If the alignment doesn't match that in the buffer, then look at the counter and see if there are enough duplicates to note the location. This would then be repeated. This is easy enough with single-end reads, but can be a bit more complicated with paired-end reads, depending on how you want to define duplicates.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k