UMI deduplication after common sequence (QIAseq miRNA Library Kit)
1
1
Entering edit mode
3.0 years ago
lluc.cabus ▴ 20

Hi everyone,

I have the fastq files for some miRNA libraries prepared with the QIAseq miRNA Library Kit. I have to do the UMI extraction, but the problem is that the UMI is after a common sequence for all the reads, such as this:

NNNNNNNNNNNNNNNNNNNAACTGTAGGCACCATCAAT*XXXXXXXXXXXX*NNNNNNNNN

Where the N are the miRNA sequences, the bold part is the common sequence for all the reads and the part with all the X is the part with the UMI sequence.

How could I remove the bold part and append the UMI to the header of the fastq file? The problem is that I have seen that around 3-5% of the reads don't have the common sequence, I suppose that there are sequencing errors and some part of this sequence is changed in some reads, but I don't know how to accept one letter change in the common part.

Thank you very much!

RNA-seq miRNA UMI • 3.1k views
ADD COMMENT
2
Entering edit mode

For future visitors: While this question has been solved, QIAGEN makes a set of web based tools available (appear to be free as of this writing) called GeneGlobe (LINK).

If you are not able to make use of umi-tools on command line then you can try GeneGlobe for analysis of QIAseq miRNA data. Handbook for QIAseq library kit has information on how to use.

ADD REPLY
0
Entering edit mode

You've got two sets of Ns here - one at the start and one at the end. Are they both miRNA sequences? If not, is it the 3' or the 5' Ns that are the miRNA sequence?

ADD REPLY
7
Entering edit mode
3.0 years ago

You should be able to do this with UMI tools using the regex UMI extraction mode.

you can do something like:

umi_tools extract --extract-method=regex \
                  --bc-pattern=".+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12}).+" \
                   -I input.fastq.gz \
                   -S processed.fastq.gz

The {s<=2} means "allow two mismatches in the common sequence". Note that this will leave both the Ns at the start of the sequence and the Ns at the end of the sequence intact. If you wish to remove the Ns at the end then you the regex: .+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})(?P<discard_2>.{9})

See more details here: https://umi-tools.readthedocs.io/en/latest/regex.html#regex-regular-expression-mode

ADD COMMENT
0
Entering edit mode

Sorry for the confusion, the N after the UMI are junk/adapters so it should be discarded. Thank you very much for the answer

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6