I have an RNA-seq dataset (HiSeq 3000, poly-A selected, 100 bp PE reads) with very high duplication, and I want to create coverage plots of the duplicates to see if they are evenly distributed or represent only a few genes. After aligning to the genome with HISAT2, I've flagged the duplicates with Picard and used Samtools depth to get the counts (of non-duplicates). I then plot them in R.
My problem is that I cannot get counts for just the duplicate reads. Here is what I've tried so far:
Extract duplicates from Picard MarkDuplicates output.
samtools view -h -f 0x400 input_markdups.bam > dups.sam
Calculate counts. Produces empty file with no errors.
samtools depth dups.sam > dups_depth.txt
Thinking that depth may be ignoring the duplicates (i.e. all reads), I removed the flags.
picard RevertSam I=dups.sam O=dups_revsam.sam REMOVE_DUPLICATE_INFORMATION=true
Once again, samtools depth runs for a minute or two and then ends with a blank file and no errors.
What am I missing? Is there an easier way to get counts for duplicates? Any help will be greatly appreciated.
Note: This is my first post here, so please let me know if I have deviated from standard procedures. Thank you!