Question

Remove duplicates from exome sequencing data

2

Entering edit mode

8.2 years ago

kulvait ▴ 270

Hi, I have some exome sequencing data of mouse. I am supposed to find differences between normal and tumor samples.

As part of my processing I run cutadapt quality and adaptor trimming. Then align using bwa mem. And then it would be nice to remove duplicates. But I have some problems with that:

*Since I used cutadapt to alter read sequences then deduplication algorithms may not work as expected since their input are trimmed sequences that might be trimmed differently for the same PCR artefact. I used -q 20 when running cutadapt. Shell I be concerned? Should I rather skip cutadapt when performing deduplication afterwards.

*There are numerous algorithms out there. I have read about samtools rmdup, samblaster or biobambam2. Which is the best and which one do you actually use and why? I do not like to use GATK tools due to the licensing model. I prefer algorithms released under GPL or less restrictive license.

Thanks, Vojtech.

variants sequencing exome-seq • 4.1k views

ADD COMMENT • link 8.2 years ago by kulvait ▴ 270

1

Entering edit mode

Deduplication is usually done by just looking at the 5' mapping position of the read (or read pair). This is incredibly simple, because otherwise it would get really complicated really quickly. If it was based on actual DNA sequences, you'd have to distinguish between multimapping reads and duplicates. You'd have to keep the DNA of the first read in memory whilst you find the mate. It would be really difficult to do, and then you have all these issues of adapter/quality trimming. Since quality usually starts high and ends low, that's why the 5' end is used rather than the 5' and 3' end. You can cut off all the low-quality bases you like from the 3' end, and it shouldn't effect deduplication. You can cut off adapter sequences from the 5' end (or better, mark them as low-quality bases, as per the GATKBP) and so long as the quality is high and the adapters get cut the same way for all PCR duplicates, which they should. So basically, to answer your question, you probably don't have to worry -- but check on your data first, thats the only real way to know.
You shouldn't ever need to trim off low quality bases in the first place by the way. bwa-mem will take quality into account, map what it can, and softclip bases which did not map at the ends of the read. Leave the low-quality stuff in there, it will only improve the mapping.
I only have experience with Picard and samtools, and I use Picard since it does the XT tagging. I appreciate your concerns with GATK's licensing model - but lets get real, here is the licence:

...researchers at academic and non-profit organizations can access the tools and source code for free, while for-profit organizations are required to purchase a license. The revenue generated by this model is then used to fund and build out our support team and infrastructure to accommodate the demand for support in the community, as well as invest more resources to improve development speed, functionality and stability overall.

There aren't many options for receiving funding for software projects. You have grants, which are the one-night-stand of software funding, and charity, which is great but hardly reliable. A well-crafted licence targeted at commercial players is really the only tool a developer has. In my humble opinion, we should on principle be supporting such licences.

ADD REPLY • link 8.2 years ago by John 13k

1

Entering edit mode

I don't think the licensing model of GATK is a good option. I think universities and big edu organizations and companies should take care about the software projects financing (like Google or even Microsoft does). If I'd like to fork samtools to try some hacks then I can easily in GIThub but it is not the case with GATK. Thus at this point I have to praise (and use) projects that use MIT (or at least GPL) license over those with more restrictive licensing. I don't like the licensing model of GATK and I don't like to use them.

I also don't share the view I can let raw FASTQ as is and let the aligner take care about all. Unfortunately adapter sequences and low quality reads can be matching actual sequence and can get mapped. It might further affect complex variant discovery.

If you read through samtools rmdup they state

 Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

I did not go through the actual code but "external coordinates" in my opinion should mean both ends of the two reads.

Thus I am now really considering to do mapping then samtools rmdup then bam to fastq conversion then quality and adapter trimming and then second mapping of these data.

Vojtech.

Now

ADD REPLY • link 8.2 years ago by kulvait ▴ 270

0

Entering edit mode

I agree that big organizations and companies should take care about the software projects they use -- but they don't do that freely - they have to be forced, with a licence. I'm not talking about pocket money for hosting, I mean paying people a living wage to develop high-quality software.

Whilst there is some OpenSource activity in the Bioinformatic space it's nothing like it is for the general tech industry. Even big general-purpose scientific projects like Numpy have a hard time getting developers. I totally appreciate your ideals and I support them where I can - but pragmatically, and i'm going to be quite blunt here, if software never gets developed, and Biology never gets discovered (or its discovery is delayed), people die from diseases that could have been prevented. Bottom line. We can argue about the pros and cons of permissive licences all day, but right now the funding structure for Bioinformatics is terrible. In fact, it's probably the root cause of your question in the first place - how long do you think it would take, with all the money and expertise Google has, to solve the problem of marking duplicates in a BAM file? Then write some good documentation, user-interfaces, etc etc, so people know exactly what its doing.

Anyway, regarding samtools rmdup, it looks like it does the same thing MarkDuplicates does, except it uses ISIZE (which i guess is TLEN? maybe TLEN without adapters...) to find the mate end. The "external coordinates" here is for the read pair, not each read - so it probably checks the TLEN, if it's negative, skip the read, if it's positive, add the POS and the POS + TLEN to a list. Then duplicates are marked if the same 5'-3' positions are found more than once. Since quality deteriorates as you read into the template, your quality trimming should also not be effected by this calculation.

But like yourself, I don't know for sure because I haven't/can't read the code, and whomever's job it was to write the samtools rmdup userguide that day had other more important stuff to do, like apply for grant funding, or work on any of the 101 other things on their to-do list with the limited staff they have. Upvote for a good debate though. There's certainly something to admire in your dedication to opensource licences :)

ADD REPLY • link 8.2 years ago by John 13k

0

Entering edit mode

Actually, even BWA developers suggest removing duplicates. See bwakit: https://github.com/lh3/bwa/tree/master/bwakit

ADD REPLY • link 8.2 years ago by igor 13k

1

Entering edit mode

Of course :) I'm not suggesting you don't remove duplicates. I'm suggesting you shouldn't trim out low-quality bases before mapping - the mapper knows what to do with them.

EDIT: and that asking big companies to pay for a licence isn't a bad thing

ADD REPLY • link 8.2 years ago by John 13k

score 2 · Answer 1 · 2016-05-03

2

Entering edit mode

8.2 years ago

kulvait ▴ 270

I am just pointing out my discussion with samtools developers regarding samtools rmdup which is considered deprecated and they suggest using biobambam or picard markduplicates. https://github.com/samtools/samtools/issues/500

ADD COMMENT • link 8.2 years ago by kulvait ▴ 270

2

Entering edit mode

This is some free as in beer advice Kulvait, but I really think you're being a little too strong on the licence issue things. You said to the htslib guys "You probably did not get the whole idea of github and writing code for community" which are quite strong words to write to another developer who's spending their free time writing code for the community. Their response about the deprecated rmdup issue was, and I quote James Bonfield "We simply haven't had the time to reimplement the better techniques offered in other tools".

That should be an indicator of how bioinformatic software development goes. It doesn't play by the same dynamics as large general-purpose projects like web-browsers, email clients, linux kernels, etc, where you have hundreds or even tens of contributors. Put simply, if you want to do the best possible research, you're going to have to put up with things you don't like. Bad documentation. Show-stopping bugs. Inefficient algorithms. Restrictive licences. But it's ok if the results are ok right? Never go full FSF.

ADD REPLY • link 8.2 years ago by John 13k

0

Entering edit mode

You are absolutely right, sometimes when something does not work as expected I overreact. I really appreciate samtools developers work. But hopefully that piece of code will be even better after my comments.

On the other hand I disagree with your opinion bioinformatics projects are something special and should be treated differently than linux kernel development in terms of tolerating errors. I really think that software development for life sciences is something that should be in public domain since it deals with our pure essence as human beings. And anyway these projects gets funging from public a lot. When you are telling "do not blame bio software for its faults" it is the same as saying "do not write bad reviews and be happy with any bio research just to exist". I think in the long run there should be switch from bioinformaticans who write scripts to support their research that eventually become used software tools to the software engineers and professionals who develop code for life science with a intention for this code doing its stuff right not necessarily being the best fit for particular data set. Because I don't think (bio) research is something better than writing (bio) software.

ADD REPLY • link 8.2 years ago by kulvait ▴ 270

score 0 · Answer 2 · 2016-04-30

0

Entering edit mode

8.2 years ago

chen ★ 2.5k

if cutadapt only removes adapters, it doesn't change the real DNA template, so it will not affect deduplication process.

If you do some trimming more than adapter, you can use same trim-length for the front and tail, then the trimmed dupes will be also aligned.

ADD COMMENT • link 8.2 years ago by chen ★ 2.5k