Question: Umi tools dedup output
gravatar for linjc.xmu
11 months ago by
linjc.xmu10 wrote:

Hi. Does anyone use UMItools for remove reads duplication? I ran dedup scripts and output 3 files, xxx_edit_distance.tsv, xxx_per_umi.tsv and xxx_per_position.tsv. If I want to calculate how many unique read I got (after removing duplication) and how many UMI/gene, which file and which column should I use? Thank you very much.

umitools • 590 views
ADD COMMENTlink modified 11 months ago by i.sudbery8.1k • written 11 months ago by linjc.xmu10
gravatar for i.sudbery
11 months ago by
Sheffield, UK
i.sudbery8.1k wrote:

The total number of reads output after deduplication is most easily retrieved from the log file, where one of the last lines of the log should be something like

# INFO 10:30:00 30-07-2019: Number of reads output (after deduplcation): XXXX

You could also get this from the BAM index of course as well

samtools idxstats my_bam.bam | awk '{total += $3} END{print total}'

If you want the total number of reads per gene, then you should be using umi_tools count rather than umi_tools dedup. It will require either that your reads have been aligned to the transcriptome rather than the genome (so that contig is now gene or transcript, rather than chromosome), or that yoru reads have been assigned to genes using featureCounts. See the umi_tools manual:

ADD COMMENTlink written 11 months ago by i.sudbery8.1k

My data is single end. When I use SAM for dedup (without sorting), it produces almost the same size as input. When I use sorted and index BAM, the output is about 10% of input BAM. But during BAM file dedup, RAMs are collapsed (I have 4X8GB RAMs, now only left 1 after 2 times running BAM dedup). My questions are: 1) if use SAM for dedup, does it requires for sorting? 2) I used samtools view -bS to convert SAM to BAM, and then sorted and indexed it. Is that right? 3) UMI-tools dedup requires a lot memory module? Thank you.

ADD REPLYlink written 11 months ago by linjc.xmu10

1) Looks like you've hit a bug with your SAM dedupping - UMI-Tools shouldn't really allow you to use an unsorted SAM file. I'll see about getting that fixed. Yes, you should use a sorted SAM file. Note that you can't use SAM input for paired-end data (although you should be okay for this single-end dataset).

2) Yes, this is the correct way to convert yoru SAM to BAM

3) UMI-tools dedup can use a lot of memory if you have a very large or complex sample. However, you can greatly reduce the memory usage by not generating the stats files. See the FAQ

ADD REPLYlink written 11 months ago by i.sudbery8.1k

I used sorted and indexed BAM for dedup on cluster, but it stopped at some point.

2019-07-31 19:40:04,781 INFO Written out 6400000 reads

2019-07-31 19:41:47,464 INFO Written out 6500000 reads

2019-07-31 19:44:46,081 INFO Written out 6600000 reads

2019-07-31 19:45:06,065 INFO Parsed 11000000 input reads

2019-07-31 19:54:55,580 INFO Written out 6700000 reads

No output msg on screen anymore. Do you have any suggestion on the file size that not larger than what on a i5 4-core and 32GB RAM desktop? Thank you very much.

ADD REPLYlink modified 11 months ago • written 11 months ago by linjc.xmu10

Can you show me the exact command you are using?

ADD REPLYlink written 11 months ago by i.sudbery8.1k

umi_tools dedup -I my.sort.bam -S my.dedup.bam

ADD REPLYlink written 11 months ago by linjc.xmu10

Hmmmm..... how long are your UMIs? What is your sequence depth. It feels like you must have a particularly complex locus if umi_tools is using more than 32GB on a single-end run with no stats.

ADD REPLYlink written 11 months ago by i.sudbery8.1k

UMI is 10 nt long. Each sample contains ~20M 75 bp reads (include UMI) Arabidopsis sample. This was generated by 18 cycles PCR. I got dedup result for the other data (also ~20M reads), which was generated by 30 cycles PCR.

ADD REPLYlink written 11 months ago by linjc.xmu10

So your UMI is plenty long enough and your read depth not too high..... It looks like you do just have a few very complex locations. You might look in the non-deduped data if you've got some genes/locations with off the chart read numbers. If this is RNAseq, perhaps you have some ribosomal contamination in this sample for example. If you find something like this, you might exclude reads from these regions (assuming you don't care about quantifying those). I would start by looking at levels of ribosomal, mitochondrial and chloroplast. Obviously mito and chloro are the easiest to exclude, but I guess you might be interested in them...

ADD REPLYlink written 11 months ago by i.sudbery8.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1016 users visited in the last hour