Question: Demultiplexing of the Illumina PE data
0
gravatar for Denis
2.1 years ago by
Denis190
Denis190 wrote:

I'm looking for a convenient tool, to demultiplex my Illumina PE data. Particularly to extract pairs with a certain sequence in the forward read and other certain sequence in the reverse one. Could you advise me please? For example: Initially, we have two fastq files with forward and reverse reads

Forvard reads sequences:

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNAGAGCGTATATGCCGAGNNNNNNNN
NNNNNNAGTCCGTATATGGGGAGNNNNNNNN

Reverse reads sequences:

NNNNNNNNNGAGATGGACTACTCACNNNNNN
NNNNNNNNNGAGATGGATTACTCACNNNNNN
NNNNNNNNNGAGAAGGACTACTCACNNNNNN

So, i'd like to extract for futher analysis only pair

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNNNNGAGATGGACTACTCACNNNNNN

Since in the forward read is AGTCCGTATATGCCGAG tag and there is GAGATGGACTACTCAC tag in the reverse read. Now i need only 100% match.

sequencing next-gen • 1.9k views
ADD COMMENTlink modified 2.1 years ago by genomax87k • written 2.1 years ago by Denis190
3

Hi Denis,

It is always useful to provide examples of input and desired output to clarify exactly what you are trying to achieve? Are you looking to select a subset of reads with a certain string? Have you looked at related posts on this forum?

Extract specific reads from FASTQ files based on subsequence

Count and location of strings in fastq file reads

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Sej Modha4.7k

Hi Sej,

I've updated my post to address your points. Thanks!

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Denis190

You can use prinseq tool with -custom-params with the specific string that you are looking for.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Sej Modha4.7k
2

Hello Denis,

thanks for adding an example. But your example doesn't look like your real input, as this is neither fasta nor fastq. Furthermore what has the task you are trying to solve to do with demultiplexing?

What I read out of your description is, that you're trying to remove duplicate sequences. This can be done for example with seqkit:

$ zcat input.fa.gz | seqkit rmdup -s -o output.fa.gz

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

fin swimmer

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by finswimmer13k

Hi fin swimmer!

Many thaks for your reply and post editing. No. I'm working with Illumina amplicon data. So i'd like to extract pairs that contain PCR primers and discard all the other read pairs.

ADD REPLYlink written 2.1 years ago by Denis190

Are these Illumina barcodes or internal barcodes/sequences?

ADD REPLYlink written 2.1 years ago by Devon Ryan96k

It's a custom internal PCR primers.

ADD REPLYlink written 2.1 years ago by Denis190
1

Are the primers more or less always in the same place? I wondering if you can use something like umi_tools or a variant of our demultiplexing script for RELACS data to handle this.

ADD REPLYlink written 2.1 years ago by Devon Ryan96k

Yes, sure. The primers are at the 5' end of forward and reverse reads.

ADD REPLYlink written 2.1 years ago by Denis190

Then the options I mentioned should work (possibly with some tweaks) too.

ADD REPLYlink written 2.1 years ago by Devon Ryan96k
2
gravatar for genomax
2.1 years ago by
genomax87k
United States
genomax87k wrote:

Denis : Since you edited this post to bump it to main page again I am going to assume that you have not been able to find a solution as yet.

I can think of using the filtering option of bbduk.sh (guide here) in a slightly complex way.
Step 1: Filter R1 reads containing AGTCCGTATATGCCGAG using literal=AGTCCGTATATGCCGAG outm=file_R1.fq.gz option.
Step 2: Filter R2 reads containing GAGATGGACTACTCAC using literal=GAGATGGACTACTCAC outm=file_R2.fq.gz option.
Step 3: Use repair.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=final_R1.fq.gz out2=final_R2.fq.gz repair to generate a final file containing R1/R2 reads that match to get the final results file. (Note: You may need plenty of memory depending on size of the data).

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by genomax87k

Hi genomax! Much appreciated for your help and providing feasible solution.

ADD REPLYlink written 2.1 years ago by Denis190
1
gravatar for gb
2.1 years ago by
gb1.9k
gb1.9k wrote:

You could use cutadapt or sabre http://cutadapt.readthedocs.io/en/stable/ https://github.com/najoshi/sabre

There are probably more options

ADD COMMENTlink written 2.1 years ago by gb1.9k

Hi gb,

Thanks for reply. It seems sabre doesn't support dual index Illumina technology. Am i right? Have to check cutadapt documentation.

ADD REPLYlink written 2.1 years ago by Denis190
1

This is the demultiplex part http://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing

I am not sure about the dual index. But sabre and cutadapt can be used for paired end reads. What kind of data is it? amplicon sequencing? In this case I mostly merge the reads first with FLASH and do the the demultiplex afterwards. If the tools do not support dual indexes you can maybe do the process twice. First on the forward index and after that on the reverse.

ADD REPLYlink written 2.1 years ago by gb1.9k

Ah! I see now that it is about PCR primers, already thought so because a lot of times the illumina indexes are already trimmed off. The merging that I mentioned makes things easier but it also depends on the length of the target so keep that in mind. If your target is 600 bases there will be no or not enough overlap to merge. So in that case it is not a good idea.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by gb1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 783 users visited in the last hour