Question: UMIextract giving ValueError: Read sequence: NNNNN is shorter than pattern: NNNNNNNN
0
gravatar for diwasri
7 weeks ago by
diwasri0
diwasri0 wrote:

Hi,

I am trying to extract UMI from raw fastq files before adapter trimming using umi_tools. The command I used is for i in *.fastq; do umi_tools extract --extract-method=string --stdin=$i --bc-pattern=NNNNNNNN -L extract.log --stdout=./UMI/$i; done

I also ran this with paired end option

$for i in *_1.fastq; do echo starting $i; umi_tools extract -I $i --bc-pattern=NNNNNNNN --read2-in=${i%%_1.fastq}_2.fastq --stdout=./UMI/$i --read2-out=./UMI/${i%%_1.fastq}_2.fastq; done

However, I get the following error ValueError: Read sequence: NNNNN is shorter than pattern: NNNNNNNN

If I look at the extracted file, then the UMI length looks alright. So, I am not sure why I am getting this error.

$ head 86063_S11_1.UMI.fastq
@A00767:92:HVMCHDRXX:1:2101:12698:1000_TCNCCAGA 1:N:0:TATCGGCT+CATCTCGT
TCCATTCCACACACATCCCTTTTCCCTTCAGAAAAGAACAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF
@A00767:92:HVMCHDRXX:1:2101:14036:1000_ACNGCTTC 1:N:0:TATCGGCT+CATCTCGT
TTTCCCATCCAAGTACTAACCAGGCCCGACCCTGCTTAGCTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00767:92:HVMCHDRXX:1:2101:14145:1000_GGNGCAGA 1:N:0:TATCGGCT+CATCTCGT
CGTTCGAATGGGTCGTCGCCGCCACGGGGGGCGTGCGATCGGC

Also the size of the extracted files is very small.

Here is the extract.log from one of the samples when run with only read1. Paired end also looks the same.

# UMI-tools version: 1.1.1
# output generated by extract --extract-method=string --stdin=86063_S11_1.fastq --bc-pattern=NNNNNNNN -L extract.log --stdout=./UMI/86063_S11_1.UMI.fastq
# job started at Mon Nov 23 14:57:22 2020 on 049ebd6653d4 -- d3635cd4-d565-40fe-a882-90eda8ac4f26
# pid: 30801, system: Linux 4.4.0-1065-aws #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : string
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : False
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : NNNNNNNN
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='86063_S11_1.fastq' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='extract.log' mode='a' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='./UMI/86063_S11_1.UMI.fastq' mode='w' encoding='UTF-8'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2020-11-23 14:57:22,495 INFO Starting barcode extraction

Thank you in advance for your help

rna-seq umi • 168 views
ADD COMMENTlink modified 7 weeks ago by i.sudbery10k • written 7 weeks ago by diwasri0
1
gravatar for i.sudbery
7 weeks ago by
i.sudbery10k
Sheffield, UK
i.sudbery10k wrote:

You are getting this error because some of your reads are shorter than the 8nt you specified for the barcode length.

If you switch to regex style extraction, then reads that don't match the pattern will be discarded.

umi_tools extract --extract-method=regex --stdin=86063_S11_1.fastq --bc-pattern='(?P<umi_1>.{8})' -L extract.log --stdout=./UMI/86063_S11_1.UMI.fastq --read2-in=read2.fastq.gz --read2-out=read2.out.fastq.gz

Your extracted files will be small because generation of the file will have stopped with it encountered the short read.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by i.sudbery10k

Thank you. That looks like it is working. However, I had a question. Why would the reads be shorter than the specified length? These are the raw files from the sequencer after just bcl2fastq conversion, i.e., I havent done any trimming yet.

ADD REPLYlink written 7 weeks ago by diwasri0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 942 users visited in the last hour