Question

How to generate coverage counts for duplicate reads

0

Entering edit mode

5.9 years ago

axolotl • 0

Greetings,

I have an RNA-seq dataset (HiSeq 3000, poly-A selected, 100 bp PE reads) with very high duplication, and I want to create coverage plots of the duplicates to see if they are evenly distributed or represent only a few genes. After aligning to the genome with HISAT2, I've flagged the duplicates with Picard and used Samtools depth to get the counts (of non-duplicates). I then plot them in R.

My problem is that I cannot get counts for just the duplicate reads. Here is what I've tried so far:

Extract duplicates from Picard MarkDuplicates output.

samtools view -h -f 0x400 input_markdups.bam > dups.sam

Calculate counts. Produces empty file with no errors.

samtools depth dups.sam > dups_depth.txt

Thinking that depth may be ignoring the duplicates (i.e. all reads), I removed the flags.

picard RevertSam I=dups.sam O=dups_revsam.sam REMOVE_DUPLICATE_INFORMATION=true

Once again, samtools depth runs for a minute or two and then ends with a blank file and no errors.

What am I missing? Is there an easier way to get counts for duplicates? Any help will be greatly appreciated.

Note: This is my first post here, so please let me know if I have deviated from standard procedures. Thank you!

RNA-Seq • 1.9k views

ADD COMMENT • link 5.9 years ago by axolotl • 0