High number of duplicates and low percentage properly paired
0
0
Entering edit mode
11 months ago
Sam • 0

I have some paired end sequencing data that I have trimmed using cutadapt. It was sequenced on an illumina novaseq 6000 and is low coverage RADseq data (2-3x). My cutadapt script used forward and reverse adapters from illumina :

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o cutadapt_${c1}_RA_${c2}.fq.gz -p cutadapt_${c1}_RB_${c2}.fq.gz ${c1}_RA_${c2}.fastq ${c1}_RB_${c2}.fastq

I got these general illumnia adapters from another post but not sure they are what I need.

I use the trimmed files with bwa mem and samtools to remove duplicates and align to my reference genome. However, many times the number of unique reads are removed as duplicates and the number of properly paired reads is low (40-75%). Here is a samtools flagstat output before and after removing duplicates.

666269 + 0 in total (QC-passed reads + QC-failed reads)
643462 + 0 primary
0 + 0 secondary
22807 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
651549 + 0 mapped (97.79% : N/A)
628742 + 0 primary mapped (97.71% : N/A)
643462 + 0 paired in sequencing
321731 + 0 read1
321731 + 0 read2
564166 + 0 properly paired (87.68% : N/A)
623336 + 0 with itself and mate mapped
5406 + 0 singletons (0.84% : N/A)
50048 + 0 with mate mapped to a different chr
21605 + 0 with mate mapped to a different chr (mapQ>=5)

ZZZZZZZZ markdup remove duplicates ZZZZZZZZZZZZ
103847 + 0 in total (QC-passed reads + QC-failed reads)
98158 + 0 primary
0 + 0 secondary
5689 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
93711 + 0 mapped (90.24% : N/A)
88022 + 0 primary mapped (89.67% : N/A)
98158 + 0 paired in sequencing
49079 + 0 read1
49079 + 0 read2
69934 + 0 properly paired (71.25% : N/A)
87200 + 0 with itself and mate mapped
822 + 0 singletons (0.84% : N/A)
15512 + 0 with mate mapped to a different chr
5276 + 0 with mate mapped to a different chr (mapQ>=5)

Any suggestions or thoughts on how to improve the pairing or how I ended up with so many duplicates would be greatly appreciated. I am trying to troubleshoot to figure out if this is a library prep, sequencing, or bioinformatics issue. Hoping it's the later!

alignment sequencing samtools • 437 views
ADD COMMENT
0
Entering edit mode

Can you try stacks (LINK) which is a pipeline meant for RADseq data analysis.

ADD REPLY

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6