Question

Umi dedup with fastp

0

Entering edit mode

9 months ago

samuel ▴ 240

Can anyone help.

I am trying to use fastp to dedup using the UMI. There is an explanation here Tutorial:Use fastp to preprocess FASTQ data with unique molecular identifer (UMI) integrated but it only gives an example if the UMI is at the head of read1.

What does the manual mean by 'the first/second index is used as UMI'. Could someone give an example how we would use this with the index1/2 option for --umi?

Additionally, does anyone know how to extract the UMI if it is on a seperate 'index' file?

fastp • 970 views

ADD COMMENT • link updated 9 months ago by i.sudbery 19k • written 9 months ago by samuel ▴ 240

0

Entering edit mode

'the first/second index is used as UMI'

That is referring to using Illumina indexes as source of UMI. Extracting UMI and deduplicating them are going to be two separate operations.

ADD REPLY • link 9 months ago by GenoMax 141k

0

Entering edit mode

Do you mean a separate file for the index read?

ADD REPLY • link 9 months ago by samuel ▴ 240

0

Entering edit mode

No I think this simply means that the UMI will be taken from the index sequence. I am not sure if fastp will take the sequence of index from fastq header or it will require a separate file with index reads. You should be able to test that with a small dataset easily.

ADD REPLY • link 9 months ago by GenoMax 141k

score 0 · Answer 1 · 2023-07-24

0

Entering edit mode

9 months ago

i.sudbery 19k

I can't answer the first question, about dedup with fastp, but UMI-tools can add UMI sequences from one file to the read headers of another.

ADD COMMENT • link 9 months ago by i.sudbery 19k

0

Entering edit mode

Ian, I have been using umi_tools which is great. I have a request to dedup at fastq level and so I have been looking around for options. I saw you post here De-duplicate UMI at FASTQ level In your opinion does it still hold true that we shouldn't be doing de-dup at fastq level with UMIs??

ADD REPLY • link 9 months ago by samuel ▴ 240

0

Entering edit mode

Depends on what you are doing. There are times when it is okay, and times when it is not. It depends on how many reads you are looking at, and the total number of available UMIs. If you have a 10mer UMI, then there are 1million posible sequences. The statistics of the situation mean that you are okay doing UMI only dedup if you have fewer than around 300,000 unique molecules. More than that and you are going to start getting collisions. You can use the sequence of the read itself to distingusih, but only if you account for the possibility of sequencing errors.

However, irrespective of whether you use umi-tools to do the dedup, you can use umi-tools to extract the UMIs, that another tool can then use to do the dedup.

My reading of the fastp tutorial that you link is that fastp doesn't dedup the reads, but rather just moves the UMI from the head of read1 or from the index read to the end of the read name.

umi_tools extract exactly this, but as far as I can tell is more flexible.

You will still need a seperate tool, either way, to do the actaul deduplication.

ADD REPLY • link 9 months ago by i.sudbery 19k