Question

UMIextract giving ValueError: Read sequence: NNNNN is shorter than pattern: NNNNNNNN

0

Entering edit mode

4.6 years ago

diwasri • 0

Hi,

I am trying to extract UMI from raw fastq files before adapter trimming using umi_tools. The command I used is for i in *.fastq; do umi_tools extract --extract-method=string --stdin=$i --bc-pattern=NNNNNNNN -L extract.log --stdout=./UMI/$i; done

I also ran this with paired end option

$for i in *_1.fastq; do echo starting $i; umi_tools extract -I $i --bc-pattern=NNNNNNNN --read2-in=${i%%_1.fastq}_2.fastq --stdout=./UMI/$i --read2-out=./UMI/${i%%_1.fastq}_2.fastq; done

However, I get the following error ValueError: Read sequence: NNNNN is shorter than pattern: NNNNNNNN

If I look at the extracted file, then the UMI length looks alright. So, I am not sure why I am getting this error.

$ head 86063_S11_1.UMI.fastq
@A00767:92:HVMCHDRXX:1:2101:12698:1000_TCNCCAGA 1:N:0:TATCGGCT+CATCTCGT
TCCATTCCACACACATCCCTTTTCCCTTCAGAAAAGAACAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF
@A00767:92:HVMCHDRXX:1:2101:14036:1000_ACNGCTTC 1:N:0:TATCGGCT+CATCTCGT
TTTCCCATCCAAGTACTAACCAGGCCCGACCCTGCTTAGCTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00767:92:HVMCHDRXX:1:2101:14145:1000_GGNGCAGA 1:N:0:TATCGGCT+CATCTCGT
CGTTCGAATGGGTCGTCGCCGCCACGGGGGGCGTGCGATCGGC

Also the size of the extracted files is very small.

Here is the extract.log from one of the samples when run with only read1. Paired end also looks the same.

# UMI-tools version: 1.1.1
# output generated by extract --extract-method=string --stdin=86063_S11_1.fastq --bc-pattern=NNNNNNNN -L extract.log --stdout=./UMI/86063_S11_1.UMI.fastq
# job started at Mon Nov 23 14:57:22 2020 on 049ebd6653d4 -- d3635cd4-d565-40fe-a882-90eda8ac4f26
# pid: 30801, system: Linux 4.4.0-1065-aws #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : string
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : False
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : NNNNNNNN
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='86063_S11_1.fastq' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='extract.log' mode='a' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='./UMI/86063_S11_1.UMI.fastq' mode='w' encoding='UTF-8'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2020-11-23 14:57:22,495 INFO Starting barcode extraction

Thank you in advance for your help

RNA-Seq UMI • 1.9k views

ADD COMMENT • link updated 4.6 years ago by i.sudbery 21k • written 4.6 years ago by diwasri • 0

score 1 · Answer 1 · 2020-11-24

1

Entering edit mode

4.6 years ago

i.sudbery 21k

You are getting this error because some of your reads are shorter than the 8nt you specified for the barcode length.

If you switch to regex style extraction, then reads that don't match the pattern will be discarded.

umi_tools extract --extract-method=regex --stdin=86063_S11_1.fastq --bc-pattern='(?P<umi_1>.{8})' -L extract.log --stdout=./UMI/86063_S11_1.UMI.fastq --read2-in=read2.fastq.gz --read2-out=read2.out.fastq.gz

Your extracted files will be small because generation of the file will have stopped with it encountered the short read.

ADD COMMENT • link 4.6 years ago by i.sudbery 21k

0

Entering edit mode

Thank you. That looks like it is working. However, I had a question. Why would the reads be shorter than the specified length? These are the raw files from the sequencer after just bcl2fastq conversion, i.e., I havent done any trimming yet.

ADD REPLY • link 4.6 years ago by diwasri • 0