Question: Very High Percentage Of Reads Are Pcr Duplicates - Iontorrent
0
gravatar for Davy
6.0 years ago by
Davy360
United States
Davy360 wrote:

Hi All, Recently I have been given some targeted iontorrent sequencing data to play with. It's not large amount of data only ~ 18,000 unpaired reads. I have aligned the reads with BWA, pretty much the same as I have always done with illumina fastq files. (about 80% aligned, which seemed a bit low, but whatever, I pushed on).

I then went on to mark the PCR duplicates with picard. After looking at the metrics file and then using flagstat on the resulting bam file a large portion (>70%) of the reads are duplicates. This doesn't seem quite right to me, and I was just wondering if anyone has come across this before or might have any suggestions as to what to do next. (Sure I can't use the data after removing over 70% of it, can I???)

Here is the output of flagstat before and after marking the duplicates:

>samtools flagstat sample002.s.bam
17795 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
14258 + 0 mapped (80.12%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

After marking with Picard

java -Xmx8g -jar MarkDuplicates.jar I=sample002.s.bam O=sample002.ds.bam M=./metrics/sample002.markdups_metrics.txt AS=true VALIDATION_STRINGENCY=LENIENT
    >samtools flagstat sample002.ds.bam
    17795 + 0 in total (QC-passed reads + QC-failed reads)
    13064 + 0 duplicates
    14258 + 0 mapped (80.12%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

Cheers, Davy

sequencing samtools pcr • 5.7k views
ADD COMMENTlink written 6.0 years ago by Davy360
1

Not used ion torrent myself but would be curious to see what the fastqc report looked like (quality of data). 80% may be due to poor data and may need some trimming to map more, although BWA-mem trims automatically. Which BWA algorithm have you used?

ADD REPLYlink written 6.0 years ago by rob234king580

I used the standard bwa aln in version 0.6.2. The fastQC reports showed the tails of the reads to quite low quality, so I will try BWA-mem to see if the alignment quality improves. Cheers.

ADD REPLYlink written 6.0 years ago by Davy360
1
gravatar for arno.guille
6.0 years ago by
arno.guille400
France
arno.guille400 wrote:

Probably your library have been constructed by AmpliSeq which includes PCR. In other words, your result is normal. Don't eliminate duplicates with Picard in this case.

ADD COMMENTlink written 6.0 years ago by arno.guille400

Can you explain why I wouldn't need to remove the duplicates? If there is an error early on in the PCR cycle won't that propagate and cause spurious SNP detection, in addition to artificially inflating the read depth?

ADD REPLYlink written 6.0 years ago by Davy360
1

With Ion torrent and specially with target sequencing, it's normal to have a lot of duplicates. That's why the markduplicate step removes a lot of reads and if you do it, you will miss too much true SNP and INDEL. On the contrary, on Whole Exome Sequencing, you expect to have very few duplicates, and in this case it's appropriate to remove duplicates. For the alignment i suggest you to use bwasw which is specially designed for long reads

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by arno.guille400
2

I agree that the high number of "PCR duplicates" is probably normal if you have a high coverage over a small region (just compute the probability of having two reads starting and ending in the same exact position...). The decision if to keep or remove them is hard and depends on the experimental design. Keeping them can cause the emergence of false positive SNPs in the case you suggested (early error in PCR, and I observed one such instance). If your coverage is high and you have individual data (not pooled) I don't think removing them should cause loss of SNPs, but I am not 100% sure. The best thing would be to check...

ADD REPLYlink written 6.0 years ago by Fabio Marroni2.3k

Thanks Fabio and Arno. I will continue to seek opinions, but this does make sense to me, so I will continue on with the pipeline for now. Cheers!

ADD REPLYlink written 6.0 years ago by Davy360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 768 users visited in the last hour