Hi, I have some exome sequencing data of mouse. I am supposed to find differences between normal and tumor samples.
As part of my processing I run cutadapt quality and adaptor trimming. Then align using bwa mem. And then it would be nice to remove duplicates. But I have some problems with that:
*Since I used cutadapt to alter read sequences then deduplication algorithms may not work as expected since their input are trimmed sequences that might be trimmed differently for the same PCR artefact. I used -q 20 when running cutadapt. Shell I be concerned? Should I rather skip cutadapt when performing deduplication afterwards.
*There are numerous algorithms out there. I have read about samtools rmdup, samblaster or biobambam2. Which is the best and which one do you actually use and why? I do not like to use GATK tools due to the licensing model. I prefer algorithms released under GPL or less restrictive license.
Thanks, Vojtech.
Deduplication is usually done by just looking at the 5' mapping position of the read (or read pair). This is incredibly simple, because otherwise it would get really complicated really quickly. If it was based on actual DNA sequences, you'd have to distinguish between multimapping reads and duplicates. You'd have to keep the DNA of the first read in memory whilst you find the mate. It would be really difficult to do, and then you have all these issues of adapter/quality trimming. Since quality usually starts high and ends low, that's why the 5' end is used rather than the 5' and 3' end. You can cut off all the low-quality bases you like from the 3' end, and it shouldn't effect deduplication. You can cut off adapter sequences from the 5' end (or better, mark them as low-quality bases, as per the GATKBP) and so long as the quality is high and the adapters get cut the same way for all PCR duplicates, which they should. So basically, to answer your question, you probably don't have to worry -- but check on your data first, thats the only real way to know.
You shouldn't ever need to trim off low quality bases in the first place by the way. bwa-mem will take quality into account, map what it can, and softclip bases which did not map at the ends of the read. Leave the low-quality stuff in there, it will only improve the mapping.
I only have experience with Picard and samtools, and I use Picard since it does the XT tagging. I appreciate your concerns with GATK's licensing model - but lets get real, here is the licence:
There aren't many options for receiving funding for software projects. You have grants, which are the one-night-stand of software funding, and charity, which is great but hardly reliable. A well-crafted licence targeted at commercial players is really the only tool a developer has. In my humble opinion, we should on principle be supporting such licences.
I don't think the licensing model of GATK is a good option. I think universities and big edu organizations and companies should take care about the software projects financing (like Google or even Microsoft does). If I'd like to fork samtools to try some hacks then I can easily in GIThub but it is not the case with GATK. Thus at this point I have to praise (and use) projects that use MIT (or at least GPL) license over those with more restrictive licensing. I don't like the licensing model of GATK and I don't like to use them.
I also don't share the view I can let raw FASTQ as is and let the aligner take care about all. Unfortunately adapter sequences and low quality reads can be matching actual sequence and can get mapped. It might further affect complex variant discovery.
If you read through samtools rmdup they state
I did not go through the actual code but "external coordinates" in my opinion should mean both ends of the two reads.
Thus I am now really considering to do mapping then samtools rmdup then bam to fastq conversion then quality and adapter trimming and then second mapping of these data.
Vojtech.
Now
I agree that big organizations and companies should take care about the software projects they use -- but they don't do that freely - they have to be forced, with a licence. I'm not talking about pocket money for hosting, I mean paying people a living wage to develop high-quality software.
Whilst there is some OpenSource activity in the Bioinformatic space it's nothing like it is for the general tech industry. Even big general-purpose scientific projects like Numpy have a hard time getting developers. I totally appreciate your ideals and I support them where I can - but pragmatically, and i'm going to be quite blunt here, if software never gets developed, and Biology never gets discovered (or its discovery is delayed), people die from diseases that could have been prevented. Bottom line. We can argue about the pros and cons of permissive licences all day, but right now the funding structure for Bioinformatics is terrible. In fact, it's probably the root cause of your question in the first place - how long do you think it would take, with all the money and expertise Google has, to solve the problem of marking duplicates in a BAM file? Then write some good documentation, user-interfaces, etc etc, so people know exactly what its doing.
Anyway, regarding samtools rmdup, it looks like it does the same thing MarkDuplicates does, except it uses ISIZE (which i guess is TLEN? maybe TLEN without adapters...) to find the mate end. The "external coordinates" here is for the read pair, not each read - so it probably checks the TLEN, if it's negative, skip the read, if it's positive, add the POS and the POS + TLEN to a list. Then duplicates are marked if the same 5'-3' positions are found more than once. Since quality deteriorates as you read into the template, your quality trimming should also not be effected by this calculation.
But like yourself, I don't know for sure because I haven't/can't read the code, and whomever's job it was to write the samtools rmdup userguide that day had other more important stuff to do, like apply for grant funding, or work on any of the 101 other things on their to-do list with the limited staff they have. Upvote for a good debate though. There's certainly something to admire in your dedication to opensource licences :)
Actually, even BWA developers suggest removing duplicates. See bwakit: https://github.com/lh3/bwa/tree/master/bwakit
Of course :) I'm not suggesting you don't remove duplicates. I'm suggesting you shouldn't trim out low-quality bases before mapping - the mapper knows what to do with them.
EDIT: and that asking big companies to pay for a licence isn't a bad thing