I have a single-cell RNAseq data set with a very long barcode structure, containing three BC (N
, each 10nt long) separated by a fixed spacer (each 10nt long). This is then followed directly by the UMI (U
) of 8nt long
The pattern should be like that: NNNNNNNNNNCAGCTACTGCNNNNNNNNNNCGAGTACCCTNNNNNNNNNNUUUUUUUU
.
When I am running this whitelist command
umi_tools whitelist --stdin R1_001.fastq.gz \
--extract-method=regex \
--bc-pattern='(?P<cell_1>.{10})(?P<discard_1>CAGCTACTGC)(?P<cell_2>.{10})(?P<discard_1>CGAGTACCCT)(?P<cell_3>.{10})(?P<umi_1>.{8}).*' \
--knee-method=density --plot-prefix '11_mRNA' \
--ed-above-threshold=correct \
--log2stderr -L 1_mRNA.log > 1_mRNA_whitelist.txt;
With and without the --knee-method=density
I get completely different results with only 581 barcodes in the whitelist with but over 1700 barcodes are written to the tsv
file with the knee method.
Is there a reason for this big difference?
I was wondering how I can figure out why I have such aa low number of barcodes. My fastq file has almost 20Mil reads and I expected more barcodes (these list is of over 18K unique barcodes).
thanks Assa
If you can post the knee plot, that might be helpful. But I will add that the whitelist command in UMI-tools was primarily designed to be a convenience, and never claimed to be state-of-the-art.
thanks for the answer. I wasn't complaining about the tool (or the results for that matter). I'm just trying to understand why the difference is so big.
I'm attaching both the one with the
density
and without thedensity
parameter.Is it possible to do the
extract
without the whitelist parameter? Also for single-cell fastq pairs?I didn't mean to sound like I was taking it as a criticism, I genuinely meant that we never intended to be a foolproof whitelisting tool. That said, those plots look very strange. If you look at the y axis, you'll see that you never have more than 10 UMIs for a cell barcode. A kneeplot should look more like:
Note that "counts" goes all the way up to 10^5.
Is it possible you've specified your read geometry slightly wrong and most reads are being rejected? You could try allowing some errors in your fixed discard sequentes by adding, eg.
{e<=2}
after the groups.Its also perfectly possible to run extract without a whitelist.
Thank you for the explanation. I can see the plots look strange, but i truly don't know why this is. I have a feeling that the sequencing run didn't capture what we want or there was something wrong during the library preparation.
I also attach below the barcode structure. Maybe you can see something wrong in the pattern I'm using.
what is the difference between
{e<=2}
and{s<=2}
?I can't immediately see anything wrong. The log file should contain lines saying how many reads were parsed, how many reads matched the barcode pattern, and how many unique cell barcodes were seen in total.
{e<=2}
is an edit distance of two. It would include insertions and deletions.{s<=2}
is a subsitution limit of two, that would only include substitions, not indels.I think I might know my problem. The barcodes are in this case on read2 and not on read1. Does umi-tools recognise it automatically, or do I need to switch read1 and read2 in the command?
No, UMI-tools won't work this out automatically, you'll need to either switch read1 and read2 or use the new
--read2-only
which is currently only available from the version on the master branch on github.and if I use the command as such:
Where I only give the
bc-pattern2
as search parameter. Would this works as well?```