Question: Demultiplexing of the Illumina PE data
0
gravatar for Denis
4 months ago by
Denis30
Denis30 wrote:

I'm looking for a convenient tool, to demultiplex my Illumina PE data. Particularly to extract pairs with a certain sequence in the forward read and other certain sequence in the reverse one. Could you advise me please? For example: Initially, we have two fastq files with forward and reverse reads

Forvard reads sequences:

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNAGAGCGTATATGCCGAGNNNNNNNN
NNNNNNAGTCCGTATATGGGGAGNNNNNNNN

Reverse reads sequences:

NNNNNNNNNGAGATGGACTACTCACNNNNNN
NNNNNNNNNGAGATGGATTACTCACNNNNNN
NNNNNNNNNGAGAAGGACTACTCACNNNNNN

So, i'd like to extract for futher analysis only pair

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNNNNGAGATGGACTACTCACNNNNNN

Since in the forward read is AGTCCGTATATGCCGAG tag and there is GAGATGGACTACTCAC tag in the reverse read. Now i need only 100% match.

sequencing next-gen • 532 views
ADD COMMENTlink modified 4 months ago by genomax58k • written 4 months ago by Denis30
3

Hi Denis,

It is always useful to provide examples of input and desired output to clarify exactly what you are trying to achieve? Are you looking to select a subset of reads with a certain string? Have you looked at related posts on this forum?

Extract specific reads from FASTQ files based on subsequence

Count and location of strings in fastq file reads

ADD REPLYlink modified 4 months ago • written 4 months ago by Sej Modha3.8k

Hi Sej,

I've updated my post to address your points. Thanks!

ADD REPLYlink modified 4 months ago • written 4 months ago by Denis30

You can use prinseq tool with -custom-params with the specific string that you are looking for.

ADD REPLYlink modified 4 months ago • written 4 months ago by Sej Modha3.8k
2

Hello Denis,

thanks for adding an example. But your example doesn't look like your real input, as this is neither fasta nor fastq. Furthermore what has the task you are trying to solve to do with demultiplexing?

What I read out of your description is, that you're trying to remove duplicate sequences. This can be done for example with seqkit:

$ zcat input.fa.gz | seqkit rmdup -s -o output.fa.gz

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

fin swimmer

ADD REPLYlink modified 4 months ago • written 4 months ago by finswimmer6.7k

Hi fin swimmer!

Many thaks for your reply and post editing. No. I'm working with Illumina amplicon data. So i'd like to extract pairs that contain PCR primers and discard all the other read pairs.

ADD REPLYlink written 4 months ago by Denis30

Are these Illumina barcodes or internal barcodes/sequences?

ADD REPLYlink written 4 months ago by Devon Ryan86k

It's a custom internal PCR primers.

ADD REPLYlink written 4 months ago by Denis30
1

Are the primers more or less always in the same place? I wondering if you can use something like umi_tools or a variant of our demultiplexing script for RELACS data to handle this.

ADD REPLYlink written 4 months ago by Devon Ryan86k

Yes, sure. The primers are at the 5' end of forward and reverse reads.

ADD REPLYlink written 4 months ago by Denis30

Then the options I mentioned should work (possibly with some tweaks) too.

ADD REPLYlink written 4 months ago by Devon Ryan86k
2
gravatar for genomax
4 months ago by
genomax58k
United States
genomax58k wrote:

Denis : Since you edited this post to bump it to main page again I am going to assume that you have not been able to find a solution as yet.

I can think of using the filtering option of bbduk.sh (guide here) in a slightly complex way.
Step 1: Filter R1 reads containing AGTCCGTATATGCCGAG using literal=AGTCCGTATATGCCGAG outm=file_R1.fq.gz option.
Step 2: Filter R2 reads containing GAGATGGACTACTCAC using literal=GAGATGGACTACTCAC outm=file_R2.fq.gz option.
Step 3: Use repair.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=final_R1.fq.gz out2=final_R2.fq.gz repair to generate a final file containing R1/R2 reads that match to get the final results file. (Note: You may need plenty of memory depending on size of the data).

ADD COMMENTlink modified 4 months ago • written 4 months ago by genomax58k

Hi genomax! Much appreciated for your help and providing feasible solution.

ADD REPLYlink written 4 months ago by Denis30
1
gravatar for gb
4 months ago by
gb450
gb450 wrote:

You could use cutadapt or sabre http://cutadapt.readthedocs.io/en/stable/ https://github.com/najoshi/sabre

There are probably more options

ADD COMMENTlink written 4 months ago by gb450

Hi gb,

Thanks for reply. It seems sabre doesn't support dual index Illumina technology. Am i right? Have to check cutadapt documentation.

ADD REPLYlink written 4 months ago by Denis30
1

This is the demultiplex part http://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing

I am not sure about the dual index. But sabre and cutadapt can be used for paired end reads. What kind of data is it? amplicon sequencing? In this case I mostly merge the reads first with FLASH and do the the demultiplex afterwards. If the tools do not support dual indexes you can maybe do the process twice. First on the forward index and after that on the reverse.

ADD REPLYlink written 4 months ago by gb450

Ah! I see now that it is about PCR primers, already thought so because a lot of times the illumina indexes are already trimmed off. The merging that I mentioned makes things easier but it also depends on the length of the target so keep that in mind. If your target is 600 bases there will be no or not enough overlap to merge. So in that case it is not a good idea.

ADD REPLYlink modified 4 months ago • written 4 months ago by gb450
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1821 users visited in the last hour