Question

UMI-Tools knee-method has great influence on the results of white list

0

Entering edit mode

6 months ago

Assa Yeroslaviz ★ 1.9k

I have a single-cell RNAseq data set with a very long barcode structure, containing three BC (N, each 10nt long) separated by a fixed spacer (each 10nt long). This is then followed directly by the UMI (U) of 8nt long

The pattern should be like that: NNNNNNNNNNCAGCTACTGCNNNNNNNNNNCGAGTACCCTNNNNNNNNNNUUUUUUUU.

When I am running this whitelist command

umi_tools whitelist --stdin R1_001.fastq.gz \
                    --extract-method=regex \
                    --bc-pattern='(?P<cell_1>.{10})(?P<discard_1>CAGCTACTGC)(?P<cell_2>.{10})(?P<discard_1>CGAGTACCCT)(?P<cell_3>.{10})(?P<umi_1>.{8}).*' \
                    --knee-method=density --plot-prefix '11_mRNA' \
                    --ed-above-threshold=correct \
                    --log2stderr -L 1_mRNA.log > 1_mRNA_whitelist.txt;

With and without the --knee-method=density I get completely different results with only 581 barcodes in the whitelist with but over 1700 barcodes are written to the tsv file with the knee method. Is there a reason for this big difference?

I was wondering how I can figure out why I have such aa low number of barcodes. My fastq file has almost 20Mil reads and I expected more barcodes (these list is of over 18K unique barcodes).

thanks Assa

single-cell whitelist UMI RNA UMI-Tools • 1.1k views

ADD COMMENT • link updated 5 months ago by i.sudbery 20k • written 6 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

If you can post the knee plot, that might be helpful. But I will add that the whitelist command in UMI-tools was primarily designed to be a convenience, and never claimed to be state-of-the-art.

ADD REPLY • link 6 months ago by i.sudbery 20k

0

Entering edit mode

thanks for the answer. I wasn't complaining about the tool (or the results for that matter). I'm just trying to understand why the difference is so big.

I'm attaching both the one with the density and without the density parameter.

Is it possible to do the extract without the whitelist parameter? Also for single-cell fastq pairs?

density method knee method

ADD REPLY • link 6 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

I didn't mean to sound like I was taking it as a criticism, I genuinely meant that we never intended to be a foolproof whitelisting tool. That said, those plots look very strange. If you look at the y axis, you'll see that you never have more than 10 UMIs for a cell barcode. A kneeplot should look more like:

enter image description here

Note that "counts" goes all the way up to 10^5.

Is it possible you've specified your read geometry slightly wrong and most reads are being rejected? You could try allowing some errors in your fixed discard sequentes by adding, eg. {e<=2} after the groups.

Its also perfectly possible to run extract without a whitelist.

ADD REPLY • link 6 months ago by i.sudbery 20k

0

Entering edit mode

Thank you for the explanation. I can see the plots look strange, but i truly don't know why this is. I have a feeling that the sequencing run didn't capture what we want or there was something wrong during the library preparation.

I also attach below the barcode structure. Maybe you can see something wrong in the pattern I'm using.

what is the difference between {e<=2} and {s<=2}?

barcode structure

ADD REPLY • link 6 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

I can't immediately see anything wrong. The log file should contain lines saying how many reads were parsed, how many reads matched the barcode pattern, and how many unique cell barcodes were seen in total.

{e<=2} is an edit distance of two. It would include insertions and deletions. {s<=2} is a subsitution limit of two, that would only include substitions, not indels.

ADD REPLY • link 6 months ago by i.sudbery 20k

0

Entering edit mode

I think I might know my problem. The barcodes are in this case on read2 and not on read1. Does umi-tools recognise it automatically, or do I need to switch read1 and read2 in the command?

ADD REPLY • link 6 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

No, UMI-tools won't work this out automatically, you'll need to either switch read1 and read2 or use the new --read2-only which is currently only available from the version on the master branch on github.

ADD REPLY • link 6 months ago by i.sudbery 20k

0

Entering edit mode

and if I use the command as such:

umi_tools extract --extract-method=regex --error-correct-cell \
                  --bc-pattern2='(?P<cell_1>.{10})(?P<discard_1>CAGCTACTGC){e<=2}(?P<cell_2>.{10})(?P<discard_1>CGAGTACCCT){e<=2}(?P<cell_3>.{10})(?P<umi_1>.{8}).*' \
                  --stdin xxx.fastq.gz \
                  --stdout xxx.extracted_R1.fastq.gz \
                  --read2-in xxx.fastq.gz \
                  --read2-out=xxx.extracted_R2.fastq.gz \
                  -L xxx.log

Where I only give the bc-pattern2 as search parameter. Would this works as well?

```

ADD REPLY • link 6 months ago by Assa Yeroslaviz ★ 1.9k

score 3 · Accepted Answer · 2024-05-10

3

Entering edit mode

6 months ago

i.sudbery 20k

With the --read2-only option (which isn't available yet on the bioconda version), this will complain about a lack of --bc-pattern.

However:

umi_tools extract --extract-method=regex --error-correct-cell \
                  --bc-pattern='(?P<cell_1>.{10})(?P<discard_1>CAGCTACTGC){e<=2}(?P<cell_2>.{10})(?P<discard_1>CGAGTACCCT){e<=2}(?P<cell_3>.{10})(?P<umi_1>.{8}).*' \
                  --stdin xxx_R2.fastq.gz \
                  --stdout xxx.extracted_R2.fastq.gz \
                  --read2-in xxx_R1.fastq.gz \
                  --read2-out=xxx.extracted_R1.fastq.gz \
                  -L xxx.log

Would work.

ADD COMMENT • link 6 months ago by i.sudbery 20k

0

Entering edit mode

Can you pls make it an answer so that I can accept it? thx

ADD REPLY • link 5 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Hi i.sudbery, I have seen you're recommending to use the Alevin-fry pipeline now instead of umi_tools. But I couldn't find out, how to use such a complicated regular expression as a pattern for my barcode structure. Can you please tell me if this is possible at all?

Or, would it only apply for the count matrix after mapping?

thanks again for all the help.

ADD REPLY • link 5 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

We recommend alecin where it supports the kit you are using. As far as I'm aware it doesn't support custom read geometry yet, but does support a range of single cell kits.

ADD REPLY • link 5 months ago by i.sudbery 20k