UMItools dedup deduplication taking too much time + RAM
1
1
Entering edit mode
2.8 years ago
lluc.cabus ▴ 20

I have some RNAseq data from miRNAs that I have processed with Bowtie2 (aligning to miRBase). Now, when doing the deduplication with umi_tools dedup I find that some of the files take a lot of time+RAM to finish (some files take around 3-4 minutes and 4-5GB of RAM and some others take more than 2 hours and more than 100GB of RAM). The bam files before the deduplication are very similar in size and the bam files after the deduplication are also very similar in size.

Do you know which could be the reason for this? Thank you very much in advance, Lluc

Here I have a log of a sample that took more than 2 hours.

**# assigned_tag                            : None
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# detection_method                        : None
# gene_tag                                : None
# gene_transcript_map                     : None
# get_umi_method                          : read_id
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : directional
# no_sort_output                          : False
# out_sam                                 : False
# output_unmapped                         : False
# paired                                  : False
# per_cell                                : False
# per_contig                              : False
# per_gene                                : False
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stats                                   : False
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='CA015.bam.sorted.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='CA015.bam_dedup.log' mode='a' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='CA015.bam.dedup.bam' mode='w' encoding='UTF-8'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# whole_contig                            : False 
2022-01-26 13:58:28,207 INFO command: dedup --stdin=CA015.bam.sorted.bam --log=CA015.bam_dedup.log --stdout=CA015.bam.dedup.bam
2022-01-26 13:59:01,275 INFO Written out 100000 reads
2022-01-26 13:59:50,099 INFO Written out 200000 reads
2022-01-26 14:00:17,556 INFO Written out 300000 reads
2022-01-26 14:00:24,747 INFO Parsed 1000000 input reads
2022-01-26 15:43:09,464 INFO Written out 400000 reads
2022-01-26 15:43:09,478 INFO Written out 500000 reads
2022-01-26 15:43:34,766 INFO Written out 600000 reads
2022-01-26 15:44:08,201 INFO Written out 700000 reads
2022-01-26 15:45:06,353 INFO Written out 800000 reads
2022-01-26 15:47:24,894 INFO Written out 900000 reads
2022-01-26 15:47:31,984 INFO Parsed 2000000 input reads
2022-01-26 15:47:34,439 INFO Written out 1000000 reads
2022-01-26 15:48:22,124 INFO Written out 1100000 reads
2022-01-26 15:49:38,812 INFO Written out 1200000 reads
2022-01-26 15:56:26,068 INFO Written out 1300000 reads
2022-01-26 15:56:28,755 INFO Parsed 3000000 input reads
2022-01-26 16:03:26,343 INFO Written out 1400000 reads
2022-01-26 16:18:47,601 INFO Written out 1500000 reads
2022-01-26 16:18:47,605 INFO Written out 1600000 reads
2022-01-26 16:19:53,921 INFO Written out 1700000 reads
2022-01-26 16:21:05,581 INFO Written out 1800000 reads
2022-01-26 16:22:14,632 INFO Written out 1900000 reads
2022-01-26 16:22:15,241 INFO Parsed 4000000 input reads
2022-01-26 16:22:28,080 INFO Reads: Input Reads: 4005923
2022-01-26 16:22:28,080 INFO Number of reads out: 1940951
2022-01-26 16:22:28,080 INFO Total number of positions deduplicated: 1352
2022-01-26 16:22:28,080 INFO Mean number of unique UMIs per position: 1836.91
2022-01-26 16:22:28,080 INFO Max. number of unique UMIs per position: 479850
# job finished in 8639 seconds at Wed Jan 26 16:22:28 2022 -- 8575.97 62.63  0.00  0.00 -- 36bc9d11-7b8f-4d0e-a7bb-dd6495d8027f
UMI-tools bam RNA-seq • 1.6k views
ADD COMMENT
0
Entering edit mode

I am having the same trouble with 6 (out of 76) of my samples. While all the others ran fine, these 6 samples do not get done even after 16 hours. I also tried changing to --method=unique

umi_tools dedup -I ${sample}.Aligned.sortedByCoord.out.bam --paired --output-stats=$OUTPUT_PATH/05_Deduplicated_files/${sample}_deduplicated -S $OUTPUT_PATH/05_Deduplicated_files/${sample}_deduplicated.bam

Here is the output for a sample that worked enter image description here

The output of the 6 samples that didnt work look like this enter image description here

ADD REPLY
0
Entering edit mode

Please don't post screenshots of text content. They are difficult to see. Please copy and paste relevant parts of log output and format it as code using the 101010 button when in edit window.

ADD REPLY
0
Entering edit mode

You sample that worked finished after reading 4.3 million reads. The one that didn't finish was still reading reads in after 42 million reads. Its not really surprising its taking longer.

ADD REPLY
1
Entering edit mode
2.8 years ago

I don't have an answer, just a couple of guesses... If you are concerned about time and memory usage, you could try to change the option --method to use a simpler strategy to detect duplicates. The default, directional, builds a network of read groups and this could take up time and memory.

About the discrepancy you see between libraries, maybe the demanding ones have a very skewed distribution of reads so that few genes absorb most of the reads and that makes the computation heavy.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer.

I suppose in my case, the method I should use is the --method unique since the other ones rely on clustering the reads, right?

ADD REPLY

Login before adding your answer.

Traffic: 1274 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6