Why does umi_tools dedup not write anything in the deduplicated.bam file?
0
0
Entering edit mode
2.4 years ago

Hi, after getting sam file from Bowtie2, I convert it to sorted.bam and then index it according to UMI-tools documention. Seems all are fine, but, when I run umi_tools dedup with the following command, nothing is written in the output file. I had removed the '--output-stats' command to make it faster.

command: umi_tools dedup -I example.bam -S deduplicated.bam

However, before running my final dataset, I had checked with a sample dataset, it that case it was working. Do you have any idea, please? Thanks.

genome next-gen alignment • 1.4k views
1
Entering edit mode

Have you saved the log output anywhere? Either with -L logfile.log or by redirection? You need to know if dedup finished successfully. The log should end with a Job finished message.

0
Entering edit mode

Hi, I haven't save logfile. However, you are right, my job was killed due to providing low resource (ncpselect=1:ncpus=2:mem=250gb,walltime=50:00:00). Now I have provided 160 hours, it still seems resource may not be sufficient though I could add more walltime. My umi size is 18nt and the total input reads is 103861189 (single-end). I have access to HPC. What resource should I select, any idea, please?

1
Entering edit mode

So the 16S is only about 1.5kb long, and you've sequenced with over 100M reads. If coverage is even, that means you will have 74,000 reads on every position. This is going to mean either that either you have a very large number of UMIs, or a very high duplication rate. It might be import to know - if you have 74,000 different UMIs at every position, that its probably hopeless.

We've only benchmarked as far as 16,000 UMIs at one position, but in this case it takes around 30 minutes to compute. You have 1,350 positions!

Its unlikely that you do have 74,000 UMIs at every position however. If you look in the output of extract it should list every UMI identified. You can get a rough estimate of the size of the problem by dividing the number of UMIs by the length of the RNA you are sequencing. Our benchmarks suggest:

#UMIs/pos     sec/pos    Memory(total)
16            4            0.064Gb
256          4            0.064Gb
1024            8        0.064Gb
4096        64            0.3Gb
8192        256            8Gb
16,384      2048           64Gb


I'm not sure what to suggest if you are in a regieme where this sort of analysis isn't really feasible. You could divide up your BAM file by position (say 200bp blocks) and run each on a seperate process.

One possibility, that would seem to be a shame since you have a good UMI length, would be to disable error correction on the deduplication, with --method=unique. You will find this much faster, but the results won't be as good.

Alternatively, Daniel Liu's UMICollapse (https://github.com/Daniel-Liu-c0deb0t/UMICollapse) is much faster than umi_tools. It is less flexible, but it should work for your particular analysis.

0
Entering edit mode

Thank you very much. The UMICollapse worked for me. I am really grateful to you.

0
Entering edit mode

The level of resource usage is more dependent on the level of duplication that it is on the total number of reads. Memory usage in particular is dependent on the maximum number of UMIs at any one position. What sort of data is this? Is it scRNA-seq? Amplicon sequencing? Exon-sequencing?

0
Entering edit mode

My sample was E. coli ribosome. I had used Illumina Hiseq platform for sequencing. Thanks.

0
Entering edit mode

Upsss! Again, my job has been killed after running 107 hours, though my walltime was 160 hours. In error file I found the following line: 5457450.pbsserver.SC: line 15: 12808 Killed