I have some paired end sequencing data that I have trimmed using cutadapt. It was sequenced on an illumina novaseq 6000 and is low coverage RADseq data (2-3x). My cutadapt script used forward and reverse adapters from illumina :
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o cutadapt_${c1}_RA_${c2}.fq.gz -p cutadapt_${c1}_RB_${c2}.fq.gz ${c1}_RA_${c2}.fastq ${c1}_RB_${c2}.fastq
I got these general illumnia adapters from another post but not sure they are what I need.
I use the trimmed files with bwa mem and samtools to remove duplicates and align to my reference genome. However, many times the number of unique reads are removed as duplicates and the number of properly paired reads is low (40-75%). Here is a samtools flagstat output before and after removing duplicates.
666269 + 0 in total (QC-passed reads + QC-failed reads)
643462 + 0 primary
0 + 0 secondary
22807 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
651549 + 0 mapped (97.79% : N/A)
628742 + 0 primary mapped (97.71% : N/A)
643462 + 0 paired in sequencing
321731 + 0 read1
321731 + 0 read2
564166 + 0 properly paired (87.68% : N/A)
623336 + 0 with itself and mate mapped
5406 + 0 singletons (0.84% : N/A)
50048 + 0 with mate mapped to a different chr
21605 + 0 with mate mapped to a different chr (mapQ>=5)
ZZZZZZZZ markdup remove duplicates ZZZZZZZZZZZZ
103847 + 0 in total (QC-passed reads + QC-failed reads)
98158 + 0 primary
0 + 0 secondary
5689 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
93711 + 0 mapped (90.24% : N/A)
88022 + 0 primary mapped (89.67% : N/A)
98158 + 0 paired in sequencing
49079 + 0 read1
49079 + 0 read2
69934 + 0 properly paired (71.25% : N/A)
87200 + 0 with itself and mate mapped
822 + 0 singletons (0.84% : N/A)
15512 + 0 with mate mapped to a different chr
5276 + 0 with mate mapped to a different chr (mapQ>=5)
Any suggestions or thoughts on how to improve the pairing or how I ended up with so many duplicates would be greatly appreciated. I am trying to troubleshoot to figure out if this is a library prep, sequencing, or bioinformatics issue. Hoping it's the later!
Can you try
stacks
(LINK) which is a pipeline meant for RADseq data analysis.