Question: Separate the reverse sequence from R1 and corresponding forward seq from R2
0
gravatar for Sajan Raju
16 months ago by
Sajan Raju0
Finland
Sajan Raju0 wrote:

Hi,

Recently we have sequenced (V3-V4 region of16S) some samples with half of amplicon with adapters switched.

Using the F primer 5' CCTACGGGNGGCWGCAG 3' & Rev primer 5' GACTACHVGGGTATCTAATCC 3'.

So now we have both forward and reverse reads in both R1 and R2 files.

For example: *R1.fastq

@HWI-1KL166:431:HWKJKBCXY:1:1101:5020:2038 1:N:0:TCTAGACTCGTCGCTA
CCTACGGGGGGCAGCAGTGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTCTGAAGAAGGCCTTCGGGTTGTAAAGGACTTTTGTCAGGGAAGAAAAGGGCGGGGTTAATACCCCTGTCTGATGACGGTACCTGAAGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGTG
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIHIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIHIIIIIIIIIIIIIIIIHHIIIIIGHHHGIIIICHII
@HWI-1KL166:431:HWKJKBCXY:1:1101:6494:2219 1:N:0:TCTAGACTCGTCGCTA
CCTACGGGGGGCTGCAGTGAGGAATATTGGTCAATGGGCGGGAGCCTGAACCAGCCAAGTAGCGTGCAGGATGACGGCCCTATGGGTTGTAAACTGCTTTTATGCGGGGATAAAGTGAGGGACGTGTCCTTCATTGCAGGTACCGCATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCTGGAGATTA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIHIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIHIIIIIIIIIIHIIIEHEHIIIIIIIIIIHIHHG@
@HWI-1KL166:431:HWKJKBCXY:1:1101:11712:2422 1:N:0:TCTAGACTCGTCGCTA
CCTACGGGGGGCTGCAGTGGGGAATATTGCGCAATGGGGGCAACCCTGACGCAGCCATGCCGCGTGAATGAAGAAGGCCTTCGGGTTGTAAAGTTCTTTCGGTAGCGAGGAAGGCATTTAGTTTAATAAACTAAGTGATTGACGTTAACTACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTTCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGG
+
DDDDDIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIEHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIHIIHIIIIIIIIIIIIIIIIIIHIIIIIIIIIHIIIIHIIIIIIIIIIIIIIIIHIIIIIIIHHIIIIIIIIIIIIIGIIIIIHEIIIIIIIIEHIFHHHIIIIIIIGHIIIIH#
@HWI-1KL166:431:HWKJKBCXY:1:1101:14311:2424 1:N:0:TCTAGACTCGTCGCTA
GACTACTAGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTCAGCGTCAGTTGCGCTCCCGTCAGCTGCCTTCGCAATCGGAGTTCTTCGTCATATCTATGCATTCCACCGCTACACCACGCATTCCGCCTACCTCATCTACACTCAAGCCCGCCAGTATCAATGGCAATTTAGGAGTTAAGCTCCTAGATTTCACCGCTGACTTAACAGGCCGCCTACGCACCCTTTAAACCCAATAAATCCG
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIHIIIIIHIDGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIHHIIIIIIIIGIIIIIIIHHHIIIIIGIHIIIIIIIIIIIIIIIIIIHHIIIIIHIIIIIIIIIFHIGHHHIHIIHIIE 
@HWI-1KL166:431:HWKJKBCXY:1:1101:8278:3000 1:N:0:TCTAGACTCGTCGCTA
GACTACCGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTACATTCCCAAGGGGCTGCCTTCGCCTTCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACGTGGAATTCTACCCCTCCCTAAAGTACTCTAGCGACCCAGTATGAAATGCAATTCCCAGGTTAAGCCCGGGGCTTTCACACCTCACTTAAATCACCGCCTGCGCGCCCTTTACGCCCAGTTATTCCG

And same for *R2.fastq

 @HWI-1KL166:431:HWKJKBCXY:1:1101:5020:2038 2:N:0:TCTAGACTCGTCGCTA
GACTACCCGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTACATTCCCAAGGGGCTGCCTTCGCCTTCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACGTGGAATTCTACCCCTCCCTAAAGTACTCTAGCGACCCAGTATGAAATGCAATTCCCAGGTTAAGCCCGGGGCTTTCACACCTCACTTAAGTCACCGCCTGCGTGCCCTTTACGCCCAGTTATTCCGATTAA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIICHIIIIIIIIIIIIIHIIIIIIIIIIHIIHIIIIHHIIIIIIIIIIIIIIIIIIIHHHIGIHHII-8DHHIEDHIIHIIIIHIGFHHIIHH?GHHIICHHI@
@HWI-1KL166:431:HWKJKBCXY:1:1101:6494:2219 2:N:0:TCTAGACTCGTCGCTA
GACTACTCGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTCAGCGTCAGTTGCGCTCCCGTCAGCTGCCTTCGCAATCGGAGTTCTTCGACATATCTAAGCATTTCACCGCTACACGACGAATTCCGCCAACGTTGTGCGTACTCAAGGAAACCAGTATGCGCTGCAATTCAGACGTTGAGCGTCTACATTTCACAACACACTTAATCTCCAGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIHIIIIIIIH1EHHIFHIIIIIIIIIIIIIIIIIIIIIIIIIIHIHIIIIIIIHIIIICHIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIHIIHHEHFHIIHHIIIIHHHEHHHIIGHIIIHEHHCHHIH
@HWI-1KL166:431:HWKJKBCXY:1:1101:11712:2422 2:N:0:TCTAGACTCGTCGCTA
GACTACCGGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTCAGCGTCAGTTGCGCTCCCGTCAGCTGCCTTCGCAATCGGAGTTCTTCGTCATATCTAAGCATTTCACCGCTACACGACGAATTCCGCCAACGTTGTGCGTACTCAAGGAAACCAGTATGCGCTGCAAGTCAGACGTTGAGCGTCTACATTTCACAACACACTTAATCTCCGGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIHIIIHIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFGHIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIHIIHIIIIIIIIIIGIIIIIIGIIIGHHIIIIIGIIHHHIHFHIIIIIIIHIIIIIIGHHHHIIFHIIIGHIIIII@F@GHHIIHIIFHHHHHEEF
@HWI-1KL166:431:HWKJKBCXY:1:1101:14311:2424 2:N:0:TCTAGACTCGTCGCTA
CCTACGGGTGGCTGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGATGACTGTCTTATGGATTGTAAACTTCTTTTATACGGGAATAACAAGAGCCACGTGTGACTCCCTGCATGTACCGTATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGTTAAGTCC
+
ADDDDIIIHIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIHIIIIIIIIIIIIIIIDGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIHIIIHHHHIIIIIHIIIIIGIIHIGIIIIIIIIHGHHHIHIIIHIG?>CHHHIIDHIIIIGEFEHHGIHH@HCHHII?HHHFIHGHIIGHDCGH6@FH?6F#
@HWI-1KL166:431:HWKJKBCXY:1:1101:8278:3000 2:N:0:TCTAGACTCGTCGCTA
CCTACGGGTGGCTGCAGTGGGGAATATTGCGCAATGGGGGGAACCCTGACGCAGCCATGCCGCGTGAATGAAGAAGGCCTTCGGGTTGTAAAGTTCTTTCGGTATTGAGGAAGGAGTGTATGTTAATAGCATACATTATTGACGTTAAATACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCGCGCAGGCGGTGATTTA
+

Here 4th and 5th sequences are actually reverse and fwd reads in R1 and R2 resp.

Now, the challenging part is to separate the R1 and R2 reads. I have tried to extract the sequences matching the primers sequences (even with few basepair of primers).

grep -A 2 -B 1 'CCTA' *L001_R1_001.fastq | sed '/^--/d' > out_R1.fq
grep -A 2 -B 1 'GACT' *L001_R1_001.fastq | sed '/^--/d' > out_R2.fq

But the final result was unequal sequences in the files. The problem here is the mismatches inthe primers.

I have tried bbduk.sh and seqtk tool where I extracted sequences matching CCTA from R1 file then saved seq id to extract the sequence with similar names. But it doesn't work for me.

cat out_R1.fq | awk '{if(NR%4==1) print ($0)}' > R1_id.txt

I tried to bbduk.sh where ref file saved with primer bases

bbduk.sh in=*_L001_R1_001.fastq out=sep_r1.fastq ref=ref.txt ow=t

But it didn't give any output in the file, same for seqtk also. It would be appreciable if anyone can help on this.

So, in short, I want to seperate the reverse sequence from R1 and corresponding forward seq from R2.

Applogies if I made it confusing :(

ADD COMMENTlink modified 16 months ago • written 16 months ago by Sajan Raju0

Typically you have to align a read to reference genome to know whether it's reverse or forward. And I cannot understand why you want to separate them in FASTQ.

ADD REPLYlink written 16 months ago by chen1.8k

thank you for the comment. Because we have tried the switch tail (adapter switched) method and want to know how it works. is there any bias or anything.

ADD REPLYlink written 16 months ago by Sajan Raju0
2
gravatar for genomax
16 months ago by
genomax63k
United States
genomax63k wrote:

Here 4th and 5th sequences are actually R2/reverse reads.

That is not how things work with illumina. As long as the fastq header says 1:N:0:TCTAGACTCGTCGCTA, that is R1 in Illumina speak so you can't separate these sequences based on fastq headers.

EDIT: Use this method and let us know if it works to separate the two reads. Even though I have called the file R2.fq the headers will still say 1:N:0:TCTAGACTCGTCGCTA. You can use sed to change that if needed.

bbduk.sh in=orig.fq outm=R1.fq out=R2.fq literal=CCTACGGG restrictleft=8 k=8
ADD COMMENTlink modified 16 months ago • written 16 months ago by genomax63k

Thank you for the comment. I will check and let you know. Meanwhile, can you check the post again, I have updated the post with R2 file.

ADD REPLYlink written 16 months ago by Sajan Raju0

Above method should with with R2 file as well. Process the two files independently. You will need to fix the fastq headers before you merge the split files to create final R1/R.

ADD REPLYlink written 16 months ago by genomax63k

I understood that we can extract the seq matching the bases CCTACGGG will be separated to one out file. but I also want to extract the corresponding sequences matching with (same bases ) GACTACHVGGGTATCTAATCC. But it is a challenge when there is a mismatch in the any of the primers, seq will be ignored but the mate pair will be stored. Am i right?

ADD REPLYlink written 16 months ago by Sajan Raju0

You can use an error-tolerant matching algorithm.

ADD REPLYlink written 16 months ago by chen1.8k

Since CCTACGGG is common for all F primers, reads that begin with that sequence will all be separated from the rest. That is what you want in first place, correct? Do you also want to split those reads further into pools of individual barcodes?

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax63k

This is what I been trying to do. I have separated the sequences matching CCTACGGG bases. Then using their fastq IDs, I tried to extract the corresponding seq from other file. I have used the seqtk tool for that. but it wasn't printed anything to output file.

ADD REPLYlink modified 16 months ago • written 16 months ago by Sajan Raju0
1

So the bbduk.sh method above worked?

Then using their fastq IDs, I tried to extract the corresponding seq from other file.

Purely based on the fastq header? In that case you can use the separated R1 file (with CCTACGGG ) and then use repair.sh from BBMap to pull out corresponding reads from the R2 file. It should be possible to use filterbyname.sh but the repair method will be less painful.

repair.sh in1=CCTACGGG_file.fq in2=R2.fq out1=CCTACGGG_R1.fq out2=CCTACGGG_R2.fq outs=singletons.fq repair
ADD REPLYlink modified 16 months ago • written 16 months ago by genomax63k

Thanks a lot. It worked!!

repair.sh in1=CCTACGGG_file.fq in2=R2.fq out1=CCTACGGG_R1.fq out2=CCTACGGG_R2.fq outs=singletons.fq repair=t interleaved=false

I couldn't post it yesterday as I crossed my post limit per day :)

ADD REPLYlink written 16 months ago by Sajan Raju0
1

Great. Your posting limit should go up as you participate more on the site. You can accept my top-level answer (green check mark) to provide closure to this thread.

ADD REPLYlink written 16 months ago by genomax63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour