Question

umi_tools using regex greedy quantifiers only returning 1 character

0

Entering edit mode

3.1 years ago

williamtmills ▴ 20

I am trying to use umi_tools to remove UMIs and cell barcodes and leave the remaining sequence. Unfortunately, after correctly removing the umi and cell barcode, it only returns a single base from the DNA sequence when I know there should be more. In my particular case, here is the code I am using:

umi_tools extract --extract-method=regex \
--bc-pattern='(?P<umi_1>.{4})(?P<cell_1>.{15,27}).+(?P<umi_2>.{2})$'
--stdin input.fastq \
--stdout output.fastq \
--whitelist=whitelist.txt

The output fastq file looks like this for every read:

@K00124:571:HC7V2BBXX:1:2224:2047:53557_TTCAAGTAATCCAGGATAGGCT_ACTTCG 1:N:0:ATCACG
T
+
J

Where TTCAAGTAATCCAGGATAGGCT is the cell barcode and ACTTCG is the two UMIs joined together.

Why is the .+ greedy quantifier not returning every base between the cell barcode and the second UMI?

regex umi_tools • 1.6k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 3.1 years ago by williamtmills ▴ 20

0

Entering edit mode

That is indeed a puzzle. Can you post the sequence of the read pre-trimming?

ADD REPLY • link 3.1 years ago by i.sudbery 19k

0

Entering edit mode

Your question made me look closer at the sequences and what I believe may be happening is that umi_tools is ONLY trimming and returning reads that have one base between the cell barcode and the 3' UMI rather than just returning one of the bases. For example, the image below shows the read length distribution of the input fastq file (2 bases - 147 bases). There are many reads that would contain more than one base between the cell barcode and the 3' UMI yet it doesn't appear that they are getting trimmed and returned. Any idea why it is only returning reads that have one base between the cell barcode and the 3' UMI?

Input fastq read length distribution

ADD REPLY • link updated 3.1 years ago by i.sudbery 19k • written 3.1 years ago by williamtmills ▴ 20

0

Entering edit mode

Can you try forcing it to match the start of the string. So

'^(?P<umi_1>.{4})(?P<cell_1>.{15,27}).+(?P<umi_2>.{2})$'

Also, what are you trying to achieve with setting the cell barcode to between 15 and 27 matches? When do you hope it will capture 15 and when 27?

ADD REPLY • link 3.1 years ago by i.sudbery 19k

0

Entering edit mode

I had actually tried forcing it to match the start of the string with ^ but unfortunately I still get the same result of only processing reads with exactly 1 base between the cell barcode and the 3' UMI.

The 15,27 corresponds to variable length cell barcodes that are listed in the whitelist.txt file (minimum cell barcode is 15 bases, maximum length is 27 bases). That part at least appears to be working correctly in that it is only processing reads in which the cell barcode following the first UMI exactly matches one of those listed in whitelist.txt

ADD REPLY • link 3.1 years ago by williamtmills ▴ 20

0

Entering edit mode

Hmmm....

Can you try it without the whitelist?

UMI-tools will extract the barcodes first, and then compare it to the whitelist. That is, the whitelist list won't be being used when extracting UMIs and cell barcodes.

Failing that, I think I'm going to need a sample of reads to work with. It only needs to be 10 or so reads. Can you either open an issue on the umi_tools github (http://www.github.com/CGATOxford/UMI-tools), upload a link here, or if you really don't want them public email me. You can find my email address on the UMI-tools paper.

ADD REPLY • link 3.1 years ago by i.sudbery 19k