Question: Separate the reverse sequence from R1 and corresponding forward seq from R2
gravatar for Sajan Raju
21 months ago by
Sajan Raju0
Sajan Raju0 wrote:


Recently we have sequenced (V3-V4 region of16S) some samples with half of amplicon with adapters switched.

Using the F primer 5' CCTACGGGNGGCWGCAG 3' & Rev primer 5' GACTACHVGGGTATCTAATCC 3'.

So now we have both forward and reverse reads in both R1 and R2 files.

For example: *R1.fastq

@HWI-1KL166:431:HWKJKBCXY:1:1101:5020:2038 1:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:6494:2219 1:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:11712:2422 1:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:14311:2424 1:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:8278:3000 1:N:0:TCTAGACTCGTCGCTA

And same for *R2.fastq

 @HWI-1KL166:431:HWKJKBCXY:1:1101:5020:2038 2:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:6494:2219 2:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:11712:2422 2:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:14311:2424 2:N:0:TCTAGACTCGTCGCTA
@HWI-1KL166:431:HWKJKBCXY:1:1101:8278:3000 2:N:0:TCTAGACTCGTCGCTA

Here 4th and 5th sequences are actually reverse and fwd reads in R1 and R2 resp.

Now, the challenging part is to separate the R1 and R2 reads. I have tried to extract the sequences matching the primers sequences (even with few basepair of primers).

grep -A 2 -B 1 'CCTA' *L001_R1_001.fastq | sed '/^--/d' > out_R1.fq
grep -A 2 -B 1 'GACT' *L001_R1_001.fastq | sed '/^--/d' > out_R2.fq

But the final result was unequal sequences in the files. The problem here is the mismatches inthe primers.

I have tried and seqtk tool where I extracted sequences matching CCTA from R1 file then saved seq id to extract the sequence with similar names. But it doesn't work for me.

cat out_R1.fq | awk '{if(NR%4==1) print ($0)}' > R1_id.txt

I tried to where ref file saved with primer bases in=*_L001_R1_001.fastq out=sep_r1.fastq ref=ref.txt ow=t

But it didn't give any output in the file, same for seqtk also. It would be appreciable if anyone can help on this.

So, in short, I want to seperate the reverse sequence from R1 and corresponding forward seq from R2.

Applogies if I made it confusing :(

ADD COMMENTlink modified 21 months ago • written 21 months ago by Sajan Raju0

Typically you have to align a read to reference genome to know whether it's reverse or forward. And I cannot understand why you want to separate them in FASTQ.

ADD REPLYlink written 21 months ago by chen1.9k

thank you for the comment. Because we have tried the switch tail (adapter switched) method and want to know how it works. is there any bias or anything.

ADD REPLYlink written 21 months ago by Sajan Raju0
gravatar for genomax
21 months ago by
United States
genomax70k wrote:

Here 4th and 5th sequences are actually R2/reverse reads.

That is not how things work with illumina. As long as the fastq header says 1:N:0:TCTAGACTCGTCGCTA, that is R1 in Illumina speak so you can't separate these sequences based on fastq headers.

EDIT: Use this method and let us know if it works to separate the two reads. Even though I have called the file R2.fq the headers will still say 1:N:0:TCTAGACTCGTCGCTA. You can use sed to change that if needed. in=orig.fq outm=R1.fq out=R2.fq literal=CCTACGGG restrictleft=8 k=8
ADD COMMENTlink modified 21 months ago • written 21 months ago by genomax70k

Thank you for the comment. I will check and let you know. Meanwhile, can you check the post again, I have updated the post with R2 file.

ADD REPLYlink written 21 months ago by Sajan Raju0

Above method should with with R2 file as well. Process the two files independently. You will need to fix the fastq headers before you merge the split files to create final R1/R.

ADD REPLYlink written 21 months ago by genomax70k

I understood that we can extract the seq matching the bases CCTACGGG will be separated to one out file. but I also want to extract the corresponding sequences matching with (same bases ) GACTACHVGGGTATCTAATCC. But it is a challenge when there is a mismatch in the any of the primers, seq will be ignored but the mate pair will be stored. Am i right?

ADD REPLYlink written 21 months ago by Sajan Raju0

You can use an error-tolerant matching algorithm.

ADD REPLYlink written 21 months ago by chen1.9k

Since CCTACGGG is common for all F primers, reads that begin with that sequence will all be separated from the rest. That is what you want in first place, correct? Do you also want to split those reads further into pools of individual barcodes?

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax70k

This is what I been trying to do. I have separated the sequences matching CCTACGGG bases. Then using their fastq IDs, I tried to extract the corresponding seq from other file. I have used the seqtk tool for that. but it wasn't printed anything to output file.

ADD REPLYlink modified 21 months ago • written 21 months ago by Sajan Raju0

So the method above worked?

Then using their fastq IDs, I tried to extract the corresponding seq from other file.

Purely based on the fastq header? In that case you can use the separated R1 file (with CCTACGGG ) and then use from BBMap to pull out corresponding reads from the R2 file. It should be possible to use but the repair method will be less painful. in1=CCTACGGG_file.fq in2=R2.fq out1=CCTACGGG_R1.fq out2=CCTACGGG_R2.fq outs=singletons.fq repair
ADD REPLYlink modified 21 months ago • written 21 months ago by genomax70k

Thanks a lot. It worked!! in1=CCTACGGG_file.fq in2=R2.fq out1=CCTACGGG_R1.fq out2=CCTACGGG_R2.fq outs=singletons.fq repair=t interleaved=false

I couldn't post it yesterday as I crossed my post limit per day :)

ADD REPLYlink written 21 months ago by Sajan Raju0

Great. Your posting limit should go up as you participate more on the site. You can accept my top-level answer (green check mark) to provide closure to this thread.

ADD REPLYlink written 21 months ago by genomax70k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1044 users visited in the last hour