Question: Demultiplexing of the Illumina PE data
0
gravatar for Denis
11 weeks ago by
Denis30
Russia, MSU
Denis30 wrote:

I'm looking for a convenient tool, to demultiplex my Illumina PE data. Particularly to extract pairs with a certain sequence in the forward read and other certain sequence in the reverse one. Could you advise me please? For example: Initially, we have two fastq files with forward and reverse reads

Forvard reads sequences:

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNAGAGCGTATATGCCGAGNNNNNNNN
NNNNNNAGTCCGTATATGGGGAGNNNNNNNN

Reverse reads sequences:

NNNNNNNNNGAGATGGACTACTCACNNNNNN
NNNNNNNNNGAGATGGATTACTCACNNNNNN
NNNNNNNNNGAGAAGGACTACTCACNNNNNN

So, i'd like to extract for futher analysis only pair

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNNNNGAGATGGACTACTCACNNNNNN

Since in the forward read is AGTCCGTATATGCCGAG tag and there is GAGATGGACTACTCAC tag in the reverse read. Now i need only 100% match.

sequencing next-gen • 388 views
ADD COMMENTlink modified 9 weeks ago by genomax55k • written 11 weeks ago by Denis30
3

Hi Denis,

It is always useful to provide examples of input and desired output to clarify exactly what you are trying to achieve? Are you looking to select a subset of reads with a certain string? Have you looked at related posts on this forum?

Extract specific reads from FASTQ files based on subsequence

Count and location of strings in fastq file reads

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Sej Modha3.6k

Hi Sej,

I've updated my post to address your points. Thanks!

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Denis30

You can use prinseq tool with -custom-params with the specific string that you are looking for.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Sej Modha3.6k
2

Hello Denis,

thanks for adding an example. But your example doesn't look like your real input, as this is neither fasta nor fastq. Furthermore what has the task you are trying to solve to do with demultiplexing?

What I read out of your description is, that you're trying to remove duplicate sequences. This can be done for example with seqkit:

$ zcat input.fa.gz | seqkit rmdup -s -o output.fa.gz

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

fin swimmer

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by finswimmer5.3k

Hi fin swimmer!

Many thaks for your reply and post editing. No. I'm working with Illumina amplicon data. So i'd like to extract pairs that contain PCR primers and discard all the other read pairs.

ADD REPLYlink written 11 weeks ago by Denis30

Are these Illumina barcodes or internal barcodes/sequences?

ADD REPLYlink written 11 weeks ago by Devon Ryan84k

It's a custom internal PCR primers.

ADD REPLYlink written 11 weeks ago by Denis30
1

Are the primers more or less always in the same place? I wondering if you can use something like umi_tools or a variant of our demultiplexing script for RELACS data to handle this.

ADD REPLYlink written 11 weeks ago by Devon Ryan84k

Yes, sure. The primers are at the 5' end of forward and reverse reads.

ADD REPLYlink written 10 weeks ago by Denis30

Then the options I mentioned should work (possibly with some tweaks) too.

ADD REPLYlink written 10 weeks ago by Devon Ryan84k
2
gravatar for genomax
9 weeks ago by
genomax55k
United States
genomax55k wrote:

Denis : Since you edited this post to bump it to main page again I am going to assume that you have not been able to find a solution as yet.

I can think of using the filtering option of bbduk.sh (guide here) in a slightly complex way.
Step 1: Filter R1 reads containing AGTCCGTATATGCCGAG using literal=AGTCCGTATATGCCGAG outm=file_R1.fq.gz option.
Step 2: Filter R2 reads containing GAGATGGACTACTCAC using literal=GAGATGGACTACTCAC outm=file_R2.fq.gz option.
Step 3: Use repair.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=final_R1.fq.gz out2=final_R2.fq.gz repair to generate a final file containing R1/R2 reads that match to get the final results file. (Note: You may need plenty of memory depending on size of the data).

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by genomax55k

Hi genomax! Much appreciated for your help and providing feasible solution.

ADD REPLYlink written 9 weeks ago by Denis30
1
gravatar for gb
11 weeks ago by
gb310
gb310 wrote:

You could use cutadapt or sabre http://cutadapt.readthedocs.io/en/stable/ https://github.com/najoshi/sabre

There are probably more options

ADD COMMENTlink written 11 weeks ago by gb310

Hi gb,

Thanks for reply. It seems sabre doesn't support dual index Illumina technology. Am i right? Have to check cutadapt documentation.

ADD REPLYlink written 11 weeks ago by Denis30
1

This is the demultiplex part http://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing

I am not sure about the dual index. But sabre and cutadapt can be used for paired end reads. What kind of data is it? amplicon sequencing? In this case I mostly merge the reads first with FLASH and do the the demultiplex afterwards. If the tools do not support dual indexes you can maybe do the process twice. First on the forward index and after that on the reverse.

ADD REPLYlink written 11 weeks ago by gb310

Ah! I see now that it is about PCR primers, already thought so because a lot of times the illumina indexes are already trimmed off. The merging that I mentioned makes things easier but it also depends on the length of the target so keep that in mind. If your target is 600 bases there will be no or not enough overlap to merge. So in that case it is not a good idea.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by gb310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1204 users visited in the last hour