UMI-tools no tsv files at deduplication
1
0
Entering edit mode
3 months ago
Sarah • 0

Hi all,

I'm very new to bioinformatics and am currently using UMI-tools to look at my recently sequencing libraries. I have a few steps I'm struggling with:

1st question- Using the UMI-tools guide, I extracted the UMIs from my R1 file, mapped, indexed and sorted my file. Now, when I deduplicate, I only get a resulting bam file and no tsv files. When I did this using the UMI-tools example files, I got a bam file + 3 tsv files. How do I go about generating this? The command line I am using is:

$ umi_tools dedup -I Sorted.bam --output-stats=deduplicated -S deduplicated.bam

2nd question- I have seven libraries in total. I have a R1 and R2 fastq file generated from each library so 14 fastq files in total. Can I process all 14 at once, or at least the R1 and R2 of each library together? I used the command from the UMI-tools guide for paired reads but was unsuccessful. Their website provides the command below:

$ umi_tools extract -I pair.1.fastq.gz --bc-pattern=NNNXXXXNN \ 
  --read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz \
  --read2-out=processed.2.fastq.gz

Here is how I used it:

$ umi_tools extract -I FilenameR1.fastq.gz --bc-pattern=NNNXXXXNN \ 
  --read2-in=FilenameR2.fastq.gz --stdout=processed.FilenameR1.fastq.gz \
  --read2-out=processed.FilenameR2.fastq.gz

Have I made a mistake here?

Thanks in advance!!

deduplication NGS UMI DNA-seq UMI-tools • 644 views
ADD COMMENT
0
Entering edit mode

When you say it was unsucessful, what do you mean? Was there an error message?

You are also going to want to use a --bc-pattern that matches your read geometry. And unless you are analysing an iCLIP experiment from circa 2012, I doubt its NNNXXXXNN

ADD REPLY
0
Entering edit mode

Also, for the dedup, check that the job is finishing. The logging output should end with # job finished. It is doesn't, that suggests that umi_tools isn't finishing before it produces the stats files. Could be due to being killed by an OOM killer, or by a cluster resource manager.

ADD REPLY
0
Entering edit mode

In regards to the dedup step for my unpaired file- yes, it appears that umi-tools isn't finsihing as I get the following error at the end of a message with mostly the file parameters:

ValueError: file does not have a valid header (mode='rb') - is it BAM/CRAM format?

My input file is in bam format and is not empty. I'm not sure why it wouldn't have a header if I followed the previous commands from the guideline correctly.

ADD REPLY
0
Entering edit mode

Hmmm... that is starange. Have you run samtools quickcheck on the BAM file? And if that passes, how about samtools view -H this should output the header if there is one.

ADD REPLY
0
Entering edit mode

Sorry, by unsuccessful, I meant that the first fastq file text appeared in my terminal as a response. The start of the message was:

    # UMI-tools version: 1.1.4
    # output generated by extract --stdin=NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --bc-pattern=NNNNNNNNNN  

And the end was:

## 2024-01-17 16:00:14,317 INFO Reads output: 629590
# job finished in 11 seconds at Wed Jan 17 16:00:14 2024 --  7.48  1.97  0.00  0.00 -- ee347034-0ee5-4cff-b8a6-cefb72b040a8
zsh: command not found: --read2-in=

Also, yes, sorry, I hadn't put the correct pattern into the command above as I was using the example command. What I used was:

umi_tools extract -I NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --bc-pattern=NNNNNNNNNN \ 
  --read2-in= NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.fastq.gz --stdout= Processed_NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz \
  --read2-out=Processed_NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.2.fastq.gz

Today, I tried the following command which only generated a log file and a fastq.gz file for the R1 file:

umi_tools extract --stdin=NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --read2-in=NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.fastq.gz --bc-pattern= NNNNNNNNNN -L processed.log --stdout=processed_NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --read2-out=Processed_NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.2.fastq.gz

I was expecting two output files but that might be incorrect as when I get to paired de-duplication, the command only includes one input file. Sorry if I've confused you!

ADD REPLY
0
Entering edit mode

Sarah why did you delete this post?

ADD REPLY
0
Entering edit mode
3 months ago

Your error with extract is being caused by the whole command not being passed to UMI-tools.

You can tell this in two ways:

The first is the second line of the log file, which, according to your comment, reads:

# output generated by extract --stdin=NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --bc-pattern=NNNNNNNNNN

Not that the command stops after the bc-pattern option.

The second is the zsh error at the end of the output:

zsh: command not found: --read2-in=

This suggests that the second line of your statenent is being treated as a seperate command.

This usually happens if you have used a multi-line statement with line continuations (as the quick-start guide uses), but spaces after the \. That is, you have "\ " at the end of the line, rather than "\".

ADD COMMENT

Login before adding your answer.

Traffic: 1690 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6