Tutorial:Use fastp to preprocess FASTQ data with unique molecular identifer (UMI) integrated
0
1
Entering edit mode
3.9 years ago
chen ★ 2.3k

fastp (https://github.com/OpenGene/fastp) is an open source tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. Unique molecular identifer (UMI) preprocessing is one of its features.

UMI is useful for duplication elimination and error correction based on generating consensus of reads originated from a same DNA fragment. It's usually used in deep sequencing applications like ctDNA sequencing.

Commonly for Illumina platforms, UMIs can be integrated in two different places: index or head of read. To enable UMI processing, you have to enable -U or --umi option in the command line, and specify --umi_loc to specify the UMI location, it can be one of:

1. index1 the first index is used as UMI. If the data is PE, this UMI will be used for both read1/read2.
2. index2 the second index is used as UMI. PE data only, this UMI will be used for both read1/read2.
3. read1 the head of read1 is used as UMI. If the data is PE, this UMI will be used for both read1/read2.
4. read2 the head of read2 is used as UMI. PE data only, this UMI will be used for both read1/read2.
5. per_index read1 will use UMI extracted from index1, read2 will use UMI extracted from index2.
6. per_read read1 will use UMI extracted from the head of read1, read2 will use UMI extracted from the head of read2.

If --umi_loc is specified as read1, read2 or per_read, the length of UMI should specified with --umi_len.

fastp will extract the UMIs, and append them to the first part of read names, so the UMIs will also be presented in SAM/BAM records. If the UMI is in the reads, then it will be shifted from read so that the read will become shorter. If the UMI is in the index, it will be kept.

A prefix can be specified with --umi_prefix. If prefix is specified, an underline will be used to connect it and UMI. For example, UMI=AATTCCGG, prefix=UMI, then the final string presented in the name will be UMI_AATTCCGG.

# UMI example

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA


read processed with command: fastp -i testdata/R1.fq -o testdata/out.R1.fq -U --umi_loc=read1 --umi_len=8

@NS500713:64:HFKJJBGXY:1:11101:1675:1101:AAAAAAAA 1:N:0:TATAGCCT+GACCCCCA
GCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
EEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA


You can find that AAAAAAAA is shifted from the read, and the UMI label :AAAAAAAA is added to the sequence name.

UMI FASTQ fastp Tutorial • 6.1k views
0
Entering edit mode

Nice!

Did you add a cross-reference to this tutorial in the fastp tool thread?

0
Entering edit mode