Hi,
I am trying to extract UMI from raw fastq files before adapter trimming using umi_tools. The command I used is
for i in *.fastq; do umi_tools extract --extract-method=string --stdin=$i --bc-pattern=NNNNNNNN -L extract.log --stdout=./UMI/$i; done
I also ran this with paired end option
$for i in *_1.fastq; do echo starting $i; umi_tools extract -I $i --bc-pattern=NNNNNNNN --read2-in=${i%%_1.fastq}_2.fastq --stdout=./UMI/$i --read2-out=./UMI/${i%%_1.fastq}_2.fastq; done
However, I get the following error ValueError: Read sequence: NNNNN is shorter than pattern: NNNNNNNN
If I look at the extracted file, then the UMI length looks alright. So, I am not sure why I am getting this error.
$ head 86063_S11_1.UMI.fastq
@A00767:92:HVMCHDRXX:1:2101:12698:1000_TCNCCAGA 1:N:0:TATCGGCT+CATCTCGT
TCCATTCCACACACATCCCTTTTCCCTTCAGAAAAGAACAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF
@A00767:92:HVMCHDRXX:1:2101:14036:1000_ACNGCTTC 1:N:0:TATCGGCT+CATCTCGT
TTTCCCATCCAAGTACTAACCAGGCCCGACCCTGCTTAGCTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00767:92:HVMCHDRXX:1:2101:14145:1000_GGNGCAGA 1:N:0:TATCGGCT+CATCTCGT
CGTTCGAATGGGTCGTCGCCGCCACGGGGGGCGTGCGATCGGC
Also the size of the extracted files is very small.
Here is the extract.log from one of the samples when run with only read1. Paired end also looks the same.
# UMI-tools version: 1.1.1
# output generated by extract --extract-method=string --stdin=86063_S11_1.fastq --bc-pattern=NNNNNNNN -L extract.log --stdout=./UMI/86063_S11_1.UMI.fastq
# job started at Mon Nov 23 14:57:22 2020 on 049ebd6653d4 -- d3635cd4-d565-40fe-a882-90eda8ac4f26
# pid: 30801, system: Linux 4.4.0-1065-aws #75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018 x86_64
# blacklist : None
# compresslevel : 6
# correct_umi_threshold : 0
# either_read : False
# either_read_resolve : discard
# error_correct_cell : False
# extract_method : string
# filter_cell_barcode : None
# filter_cell_barcodes : False
# filter_umi : None
# filtered_out : None
# filtered_out2 : None
# ignore_suffix : False
# log2stderr : False
# loglevel : 1
# pattern : NNNNNNNN
# pattern2 : None
# prime3 : None
# quality_encoding : None
# quality_filter_mask : None
# quality_filter_threshold : None
# random_seed : None
# read2_in : None
# read2_out : False
# read2_stdout : False
# reads_subset : None
# reconcile : False
# retain_umi : None
# short_help : None
# stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin : <_io.TextIOWrapper name='86063_S11_1.fastq' mode='r' encoding='UTF-8'>
# stdlog : <_io.TextIOWrapper name='extract.log' mode='a' encoding='UTF-8'>
# stdout : <_io.TextIOWrapper name='./UMI/86063_S11_1.UMI.fastq' mode='w' encoding='UTF-8'>
# timeit_file : None
# timeit_header : None
# timeit_name : all
# tmpdir : None
# umi_correct_log : None
# umi_whitelist : None
# umi_whitelist_paired : None
# whitelist : None
2020-11-23 14:57:22,495 INFO Starting barcode extraction
Thank you in advance for your help
Thank you. That looks like it is working. However, I had a question. Why would the reads be shorter than the specified length? These are the raw files from the sequencer after just bcl2fastq conversion, i.e., I havent done any trimming yet.