Question

UMI Tools Dedup

0

Entering edit mode

6.3 years ago

Hyper_Odin ▴ 320

I am using dedup fucntion in UMI tools but still there seems to be an issue of excessive memory usage while the output is being generated. Kindly let me know if there's an alternative tool/ way to remove the remove the duplicates and count the UMI's.

https://github.com/CGATOxford/UMI-tools/issues/173

RNA-Seq umitools next-gen • 6.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 6.3 years ago by Hyper_Odin ▴ 320

1

Entering edit mode

Your post lacks any details that allow reproduction of the (what I assume it is) error or problem. What are the command lines, which error/warnings did come up, how much memory do you have and how much memory was consumed, what are the input files? Please edit your question accordingly Brief Reminder On How To Ask A Good Question

ADD REPLY • link 6.3 years ago by ATpoint 88k

0

Entering edit mode

I have added the link where many people have reported the same issue.

ADD REPLY • link 6.3 years ago by Hyper_Odin ▴ 320

1

Entering edit mode

It may still help to note what your criteria for excessive is.

This may be an alternate option to try.

ADD REPLY • link 6.3 years ago by GenoMax 152k

0

Entering edit mode

And the conclusion of the link seemed to be that for the most part, that's just how the software is. It has to remember all the reads and their indices that it comes across; this is going to be memory intense.

ADD REPLY • link 6.3 years ago by swbarnes2 15k

GenoMax · Answer 1 · 2019-03-14

3

Entering edit mode

6.3 years ago

i.sudbery 21k

If you are doing drop-seq or 10x chromium, I highly recommend alevin. Other tools that can handle UMI deduplication are STARsolo, umis and picard MarkDuplicatesWithCigar. Not that the last two do not do error correction.

Two things might cause excessive memory usage:

Many reads whose pairs are on a different contig - here there is no solution unless you are willing to drop these reads - no other tool is going to do any better.
Analysing single-cell RNA seq without using the --per-cell option.
Extreme read depth, with an appreciable % saturation of the space of possible UMIs.

The general advice for to reduce the memory usage is to not use the --output-stats option.

If you are struggling with high read depth and UMI space saturation, you can switch to --method=unique. The downside of this is that you loose UMI-tools' error correcting dedup, which we have shown introduces bias, especially at high depth. The upside is that this makes UMI-tools effectively function the same as any other UMI aware tool. Only really umi_tools, alevin and STARsolo use error correction on UMIs. Otherwise, it just is an intrisically high memory requiring task.

ADD COMMENT • link 6.3 years ago by i.sudbery 21k

0

Entering edit mode

My sequences are not from a single cell, so i guess i have stick with either umi or picard. I have tried running samples without --output-stats but it appears to be the same problem. The --method-unique works for me. I have observed that the UMI's have been removed but i am not sure how to see the no. of UMI's extracted?

Before dedup :

@SN526:357:CCAUDACXX:1:1104:2387:1878_CCAAGACCAACC 1:N:0:ATCACG
NCATTGGTGGTTCAGTGGTAGAATTCTCGC
+
#4=DFFFFHHHHHJJIJIIEHHJJJJJJJJ
@SN526:357:CCAUDACXX:1:1104:2744:1836_ACTATGTCAACT 1:N:0:ATCACG
NCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGT
+
#4=DFFFFHHHHHJJJJJJJHIJJJJJJJJJJJJIHJJJGIIII
@SN526:357:CCAUDACXX:1:1104:2683:1842_GCCTCCGCGGGG 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTG
+
#1=DDFFFHDHHHGHIIHGHHG
@SN526:357:CCAUDACXX:1:1104:2676:1990_GTGCTACTTGGG 1:N:0:ATCACG
NGACCTATGAATTGACAGCCAG

After dedup:

@SN526:357:CCAUDACXX:2:2111:20673:42502_TATCTATACGTT
CGGCGATCTATTGAAAGTCAGCCCTCGAAACAAGGGTTTGT
+
+14BDDD<DF?A:EF>FG<AEGDGD81?)?6:DFAD9BFFF
@SN526:357:CCAUDACXX:2:2116:12201:59402_CCTTCCGCCCGG
CACCGATCTATTGAAAGTCAGCCCTCGACACAAGGGTTTGT
+
1++=BDDDDDDEDEEDIIIEFEIIIIIIIIIIIDDI?D=DB
@SN526:357:CCAUDACXX:2:2114:3337:83332_TGGACATTTATC
GTGCGATCTATTGAAAGTCAGCCCTCGAGACAAGGGTTTGTC
+
1=;A;@?B?4<DC<:EC+AFBF?DGFFE):??)?@BF31??F
@SN526:357:CCAUDACXX:2:1212:11302:95510_CTTCGATCCCCG
CTGCGGTCTATTGAAAGTCAGCCCTCGACGCAAGGGTTTGT

ADD REPLY • link updated 6.3 years ago by GenoMax 152k • written 6.3 years ago by Hyper_Odin ▴ 320

score 1 · Answer 2 · 2019-06-28

1

Entering edit mode

6.0 years ago

Lior Pachter ▴ 720

The kallisto | bustools workflow has a very low memory footprint.

ADD COMMENT • link 6.0 years ago by Lior Pachter ▴ 720