Question

Fastqc /PicardMarkDuplicate

0

Entering edit mode

22 months ago

Bioinf • 0

Hello, is there someone who can explain me the difference between FASTQC an PICARD MARK DUPLICATE in marking duplicates. I got different duplication rate with the same samples.

fastqc duplicate • 591 views

ADD COMMENT • link updated 22 months ago by GenoMax 141k • written 22 months ago by Bioinf • 0

score 0 · Answer 1 · 2022-06-30

See: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level. To cut down on the amount of information in the final plot any sequences with more than 10 duplicates are placed into grouped bins to give a clear impression of the overall duplication level without having to show each individual duplication value.

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

So FastQC duplicate detection is not looking at the entire dataset and should only be used for qualitative QC.

Picard is looking at the entire dataset so should be accurate.