7.2 years ago by
University Park, USA
There are a few ways to go about it. There are tools that
- look for exact matches via an associative array (hash, dictionary): for example the
fastx_collapser in the fastx toolkit.
- look for exact matches by sorting the sequences and removing consecutive exactly identical sequences, for that you could use a combinations of command line tools such as of
- look for reads that align over the same region, for this work the data would need to be aligned against a reference genome:
samtools rmdup works this way
- cluster the reads and merge reads that are very similar to one another using a tool like
Ideally the best way to remove duplicates is that performed after alignment but depending on the problem that may not be feasible.
For more details search this site for "remove duplicates" to find good posts on various tools and techniques.