umitools facilitates the processing of data that has incorporated a unique molecular identifier (UMI). It assumes the UMI is incorporated as part of the read.
Using the IUPAC sequence design of the UMI, strip the sequence from the 5' end of the fastq:
umitools trim --end 5 unprocessed_fastq.gz NNNNNV > out.fq
The UMI sequence for reads are appended onto the read name and processed again after the reads are mapped. Duplicate UMIs at any given start site need to be removed:
umitools rmdup unprocessed.bam out.bam > before_after.bed
I've updated this to account for mismatches among a given UMI sequence set at a start site. This allows the user to essentially merge very similar UMIs into fewer representative sequences.
umitools rmdup --mismatches 1 unprocesed.bam out.bam > before_after.bed