Question: How to separate mixed orientation raw illumina sequence into forward and reverse fastqs?
0
gravatar for will.wcb
19 months ago by
will.wcb20
will.wcb20 wrote:

Hey there,

I have received some paired-end illumina MySeq sequences, and they are in two files, R1 and R2. The problem is that each of these files has a combination of forward and reverse reads. For instance:

R1

Sample1-seq1: barcode, forward primer, forward sequence

Sample1-seq2: reverse primer, reverse sequence

Sample2-seq3: barcode, forward primer, forward sequence

etc.

R2

Sample1-seq1: reverse primer, reverse sequence

Sample1-seq2: barcode, forward primer, forward sequence

Sample2-seq3: reverse primer, reverse sequence

etc.

How can I separate these into forward and reverse read files for use in QIIME2, for instance?

I will past a few lines of each

R1:

@D00420:195:HK5N5BCX2:2:1101:1200:2095 1:N:0:TGACCA
GGACTACGGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCTCCTCAGTGTCAGTTCCGGCCCAGAGCGCCGCCTTCGNNNNNNNNNNTCNNNNNNATANNNNNNNANNNNNNNNNNNNNNNNNNNNNTCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGACTTACTAAGCCACCTACGAGCTCTTTACGCCCAATAAATCC
+
GAGGGGGGGIIIIIIIIIIIGIIGIIIIGIIGGGGGIGIIIIIGIIIGGGGIIIGIIIIGGGGGGGGGGGGGGGGGIIIIG##########<<######<<<#######<#####################<<<##########################################################################777AGGGGAGAAGGGAGGGAAGGGGAGGGIIIG<.GA77AGAG
@D00420:195:HK5N5BCX2:2:1101:1327:2093 1:N:0:TGACCA
NCCTCCCTGTGTCAGCCGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGNNNNNNNNNNNTNNNNNNNAANN
+
#<GGGIGGIGGIIIIIIIIIIGGIIIIIIIIGIIGIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII###########<#######<<##
@D00420:195:HK5N5BCX2:2:1101:1946:2119 1:N:0:TGACCA
TCCTCGTAGTGTCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTCATGCAAGACAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGTGACTGCATGGCTGGAGTGCGGCAGAGGGGGATGGAATTCCGCGTGTANNNNNNNNNNNNNNAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCT
+
GAGAGGGGIAGGGGGGGAGGAGGGIGGGGIIGIGGGGIGGGGGGGGGGIGGGGGGGIIIIIIIIGGIGIGIIGIIGA<GAGGGAGGGIGGGGGGGGGGGGGGGGGGIIGG.GGGIIGGGGG.GGGIGAGGIGGAAGGGGGGGA.GGGGGIGIGGGGAGGGA<GIGAGAGGGIGIGGGGGGA##############.77AGGGGI.A.<<GGAGGGGGGGGIGG.<GGAGGIGGGI.77AGGGGAA7GGIG.
@D00420:195:HK5N5BCX2:2:1101:2658:2140 1:N:0:TGACCA
ACTACTAGGGTTTCTAATCCTGTTCGCTACCCACGCTTTCGCTCCTCAGCGTCAGGTAAGGCCCAGAGAGCCGCCTTCGCCACCGGTGTTCTTCCTGATATCTGCGCATTCCACCGCTACACCAGGAGTTCCGCTCTCCCCTGCCTACCTCTAGTCTGCCCGTATCGGAAGCAGGCTCGGAGTTAAGCTCCGAGTTTTCACTCCCGACGTGACGAACCGCCTACGAGCCCTTTACGCCCAATAATTCCGG
+
GGGGGGGIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIGGIIIGIIIIIGIIIGIIIIIIIIIGIIIIIIIIIGGIIIIIIIGIIIIGIIIIIIIIGGGGIIIIGGGIIGGIIIGGGGIGGIIIIGGIIIIGIIGGGIIGIGIIIIIIGIIIIIIIIIIGIGGGIIIIIIIAGGIGGGIIIIIIIIIIIIIIGIIIIIIIIIIGIIGGIIGIGIGIGGGGGGGGIIGGIIGGGIIIIIGIGGGGA
@D00420:195:HK5N5BCX2:2:1101:2763:2157 1:N:0:TGACCA
TCCGGCCGGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATCATTGGGCGTAAAGAGCGCGTAGGCGGCCCTGTAAGTCCGCTGTGAAAGTCAAGGGCTCAACCCTTGAATGCCGGTGGATACTGCAGGGCTAGAGTCCGGAAGAGGCGAGTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGCTCGCTGGGACGGTACTGACGCG
+
GGGGGIIGIIIIIIIIIIIIIIIIIGIIIIIIIIGIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIGGGGGIIGIIIIIIIIIIIIGIIIIIIIGIIIIIIIIGIIIIIIIIIIIIIIIIGIIIIIGIIIIIIIIGIIIIIIIIIIIIIIIGGIIIIIIIIIIIIGIIIGIIIIIIIGIIIIIIIIIGIIIIIIIIIIIIIIGIIGGGGGIIGGG.

R2:

@D00420:195:HK5N5BCX2:2:1101:1200:2095 2:N:0:TGACCA
TCCTCCCTGTGCCAGCCGCCGCGGTAACACGTAGGGGGCA
+
GGGGGGIIIIIIIGGGG<GGIGGIIIIIGIIIIIIGIIGI
@D00420:195:HK5N5BCX2:2:1101:1327:2093 2:N:0:TGACCA
GGACTACGGGGGTTTCTAATCCTGTTTGCTCCCCACGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
AGAGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIG#####################################################################################################################################################################################################################
@D00420:195:HK5N5BCX2:2:1101:1946:2119 2:N:0:TGACCA
GGACTACAGGGGTTTCTAATCCTGTTTGCGCCCCACGCTTGCGTGCATGAGCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
GAAAGA.<A.G<<..<....<<<<G.A.A..<.<A..<...<AA..AAGGG.<A##########################################################################################################################
@D00420:195:HK5N5BCX2:2:1101:2658:2140 2:N:0:TGACCA
CGTGTCAGCAGTCGCGGTAATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGNNNTCGTNNNNNGTTCNNNNCGTCNNNNGTGANNNCTCGGNGCTTNACTNNGAGCCTGCTTCCGATACGGGCAGACNAGAGGNAGGCAGGGGAGAGCGGAACTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGGCCTTACCTGACGCTGAGGAGG
+
GGAGGIIIIIGIIIIIIIIIIIIIGIIIIIIIIIIIIIGGGIIIIIIIGIIIIIIIIIIIII###<<GG#####<<GG####<<<G####<<GG###<<GGG#<<GG#<<G##<<AGGGGIIIIIIIIIIIIGGGIIG#<<GGG#<<GGGGGIIGGGGIGIGIIIIIIIIIIIGGIIIGGIIGGGGGGIGGGGGIIIIGGIGIGIIIIGGIIGGGGGGGGIA<GGGGGIIGGGAGAGGAAGGI.AGGIAG.
@D00420:195:HK5N5BCX2:2:1101:2763:2157 2:N:0:TGACCA
GGACTACACGGGTTTCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTACCGTCCCAGCGAGCTGCCTTCGCCATCGGTGTTCCTCCTGATATCTGCGCATTTCACCGCTACACCAGGAATTCCACTCGCCTCTTCCGGACTCTAGCCCTGCAGTATCCACCGGCATTCAAGGGTTGAGCCCTTGACTTTCACAGCGGACTTACAGGGCCGCCTACGCGCTCTTTACGCCCAATGATTCCG
+
AGGGGIIIIIIGGIIGIIIIIGGGIIIIIIIIIIIGIIIIIIIIIGGGIIGGIIIIIIIIGIIIGGGGIIIIIIIIIIGIIIGIIGGGIIGGGGGIGGGGGAGGGGGGIGIGIIIIGIGIIIIIIIGGIIGIGIGIGGGIIIIIGGGGGIIIIIIGGIIIGIIIIIIIGGIIGGGGIGIGGIIIGIIIIGGGIGIGGGGIGGIIIIGIGG<GGGGGGIIGGGAGAGGGGIGGGGIGGIIIIIAGGGGGGA.

Thank you very much for any help, even just to point me in the right direction.

next-gen sequence • 892 views
ADD COMMENTlink modified 19 months ago by mike-zx210 • written 19 months ago by will.wcb20
2
gravatar for mike-zx
19 months ago by
mike-zx210
mike-zx210 wrote:

I am a little bit confused since the example lines you posted seem to be the correct expected output in fastq from a normal pair-end illumina run (all of your sequence headers in R1 are 1:N:0 and all in R2 are 2:N:0) meaning forward reads are correctly placed in R1 and reverse reads in R2. However if you want to make sure all reads are correctly placed you can do the following:

zcat R?.fastq.gz | paste - - - - | grep '1:.:.:' | tr '\t' '\n' >> correct_R1.fastq
zcat R?.fastq.gz | paste - - - - | grep '2:.:.:' | tr '\t' '\n' >> correct_R2.fastq
gzip correct_R1.fastq correct_R2.fastq

Just substitute the zcat argument for the name of your actual files, the '?' instead of the strand number is so that you process both files at the same time.

Hope this helps.

ADD COMMENTlink modified 19 months ago • written 19 months ago by mike-zx210
1

Thank you very much. I thought this to be the case as well, but I am inexperienced, and the documentation provided by the sequencing company instructed us that they would be mixed. I suppose that was out of date or incorrect. Really appreciate the help.

ADD REPLYlink written 19 months ago by will.wcb20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 719 users visited in the last hour