fastp (https://github.com/OpenGene/fastp) is an open source tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. Unique molecular identifer (UMI) preprocessing is one of its features.
UMI is useful for duplication elimination and error correction based on generating consensus of reads originated from a same DNA fragment. It's usually used in deep sequencing applications like ctDNA sequencing.
Commonly for Illumina platforms, UMIs can be integrated in two different places: index or head of read. To enable UMI processing, you have to enable
--umi option in the command line, and specify
--umi_loc to specify the UMI location, it can be one of:
index1the first index is used as UMI. If the data is PE, this UMI will be used for both read1/read2.
index2the second index is used as UMI. PE data only, this UMI will be used for both read1/read2.
read1the head of read1 is used as UMI. If the data is PE, this UMI will be used for both read1/read2.
read2the head of read2 is used as UMI. PE data only, this UMI will be used for both read1/read2.
per_indexread1 will use UMI extracted from index1, read2 will use UMI extracted from index2.
per_readread1 will use UMI extracted from the head of read1, read2 will use UMI extracted from the head of read2.
--umi_loc is specified as
per_read, the length of UMI should specified with
fastp will extract the UMIs, and append them to the first part of read names, so the UMIs will also be presented in SAM/BAM records. If the UMI is in the reads, then it will be shifted from read so that the read will become shorter. If the UMI is in the index, it will be kept.
A prefix can be specified with
--umi_prefix. If prefix is specified, an underline will be used to connect it and UMI. For example,
prefix=UMI, then the final string presented in the name will be
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA + 6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
read processed with command:
fastp -i testdata/R1.fq -o testdata/out.R1.fq -U --umi_loc=read1 --umi_len=8
@NS500713:64:HFKJJBGXY:1:11101:1675:1101:AAAAAAAA 1:N:0:TATAGCCT+GACCCCCA GCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA + EEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
You can find that
AAAAAAAA is shifted from the read, and the UMI label
:AAAAAAAA is added to the sequence name.