UMItools dedup deduplication taking too much time + RAM
1
1
Entering edit mode
2.2 years ago
lluc.cabus ▴ 20

I have some RNAseq data from miRNAs that I have processed with Bowtie2 (aligning to miRBase). Now, when doing the deduplication with umi_tools dedup I find that some of the files take a lot of time+RAM to finish (some files take around 3-4 minutes and 4-5GB of RAM and some others take more than 2 hours and more than 100GB of RAM). The bam files before the deduplication are very similar in size and the bam files after the deduplication are also very similar in size.

Do you know which could be the reason for this? Thank you very much in advance, Lluc

Here I have a log of a sample that took more than 2 hours.

**# assigned_tag                            : None
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# detection_method                        : None
# gene_tag                                : None
# gene_transcript_map                     : None
# get_umi_method                          : read_id
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : directional
# no_sort_output                          : False
# out_sam                                 : False
# output_unmapped                         : False
# paired                                  : False
# per_cell                                : False
# per_contig                              : False
# per_gene                                : False
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stats                                   : False
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='CA015.bam.sorted.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='CA015.bam_dedup.log' mode='a' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='CA015.bam.dedup.bam' mode='w' encoding='UTF-8'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# whole_contig                            : False 
2022-01-26 13:58:28,207 INFO command: dedup --stdin=CA015.bam.sorted.bam --log=CA015.bam_dedup.log --stdout=CA015.bam.dedup.bam
2022-01-26 13:59:01,275 INFO Written out 100000 reads
2022-01-26 13:59:50,099 INFO Written out 200000 reads
2022-01-26 14:00:17,556 INFO Written out 300000 reads
2022-01-26 14:00:24,747 INFO Parsed 1000000 input reads
2022-01-26 15:43:09,464 INFO Written out 400000 reads
2022-01-26 15:43:09,478 INFO Written out 500000 reads
2022-01-26 15:43:34,766 INFO Written out 600000 reads
2022-01-26 15:44:08,201 INFO Written out 700000 reads
2022-01-26 15:45:06,353 INFO Written out 800000 reads
2022-01-26 15:47:24,894 INFO Written out 900000 reads
2022-01-26 15:47:31,984 INFO Parsed 2000000 input reads
2022-01-26 15:47:34,439 INFO Written out 1000000 reads
2022-01-26 15:48:22,124 INFO Written out 1100000 reads
2022-01-26 15:49:38,812 INFO Written out 1200000 reads
2022-01-26 15:56:26,068 INFO Written out 1300000 reads
2022-01-26 15:56:28,755 INFO Parsed 3000000 input reads
2022-01-26 16:03:26,343 INFO Written out 1400000 reads
2022-01-26 16:18:47,601 INFO Written out 1500000 reads
2022-01-26 16:18:47,605 INFO Written out 1600000 reads
2022-01-26 16:19:53,921 INFO Written out 1700000 reads
2022-01-26 16:21:05,581 INFO Written out 1800000 reads
2022-01-26 16:22:14,632 INFO Written out 1900000 reads
2022-01-26 16:22:15,241 INFO Parsed 4000000 input reads
2022-01-26 16:22:28,080 INFO Reads: Input Reads: 4005923
2022-01-26 16:22:28,080 INFO Number of reads out: 1940951
2022-01-26 16:22:28,080 INFO Total number of positions deduplicated: 1352
2022-01-26 16:22:28,080 INFO Mean number of unique UMIs per position: 1836.91
2022-01-26 16:22:28,080 INFO Max. number of unique UMIs per position: 479850
# job finished in 8639 seconds at Wed Jan 26 16:22:28 2022 -- 8575.97 62.63  0.00  0.00 -- 36bc9d11-7b8f-4d0e-a7bb-dd6495d8027f
UMI-tools bam RNA-seq • 1.1k views
ADD COMMENT
1
Entering edit mode
2.2 years ago

I don't have an answer, just a couple of guesses... If you are concerned about time and memory usage, you could try to change the option --method to use a simpler strategy to detect duplicates. The default, directional, builds a network of read groups and this could take up time and memory.

About the discrepancy you see between libraries, maybe the demanding ones have a very skewed distribution of reads so that few genes absorb most of the reads and that makes the computation heavy.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer.

I suppose in my case, the method I should use is the --method unique since the other ones rely on clustering the reads, right?

ADD REPLY

Login before adding your answer.

Traffic: 2537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6