Question

High number of duplicates and low percentage properly paired

0

Entering edit mode

2.2 years ago

Sam • 0

I have some paired end sequencing data that I have trimmed using cutadapt. It was sequenced on an illumina novaseq 6000 and is low coverage RADseq data (2-3x). My cutadapt script used forward and reverse adapters from illumina :

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o cutadapt_${c1}_RA_${c2}.fq.gz -p cutadapt_${c1}_RB_${c2}.fq.gz ${c1}_RA_${c2}.fastq ${c1}_RB_${c2}.fastq

I got these general illumnia adapters from another post but not sure they are what I need.

I use the trimmed files with bwa mem and samtools to remove duplicates and align to my reference genome. However, many times the number of unique reads are removed as duplicates and the number of properly paired reads is low (40-75%). Here is a samtools flagstat output before and after removing duplicates.

666269 + 0 in total (QC-passed reads + QC-failed reads)
643462 + 0 primary
0 + 0 secondary
22807 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
651549 + 0 mapped (97.79% : N/A)
628742 + 0 primary mapped (97.71% : N/A)
643462 + 0 paired in sequencing
321731 + 0 read1
321731 + 0 read2
564166 + 0 properly paired (87.68% : N/A)
623336 + 0 with itself and mate mapped
5406 + 0 singletons (0.84% : N/A)
50048 + 0 with mate mapped to a different chr
21605 + 0 with mate mapped to a different chr (mapQ>=5)

ZZZZZZZZ markdup remove duplicates ZZZZZZZZZZZZ
103847 + 0 in total (QC-passed reads + QC-failed reads)
98158 + 0 primary
0 + 0 secondary
5689 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
93711 + 0 mapped (90.24% : N/A)
88022 + 0 primary mapped (89.67% : N/A)
98158 + 0 paired in sequencing
49079 + 0 read1
49079 + 0 read2
69934 + 0 properly paired (71.25% : N/A)
87200 + 0 with itself and mate mapped
822 + 0 singletons (0.84% : N/A)
15512 + 0 with mate mapped to a different chr
5276 + 0 with mate mapped to a different chr (mapQ>=5)

Any suggestions or thoughts on how to improve the pairing or how I ended up with so many duplicates would be greatly appreciated. I am trying to troubleshoot to figure out if this is a library prep, sequencing, or bioinformatics issue. Hoping it's the later!

alignment sequencing samtools • 694 views

ADD COMMENT • link updated 2.2 years ago by GenoMax 152k • written 2.2 years ago by Sam • 0

0

Entering edit mode

Can you try stacks (LINK) which is a pipeline meant for RADseq data analysis.

ADD REPLY • link 2.2 years ago by GenoMax 152k