Question: Extract Reverse Reads Based on Header Information of Forward Reads
gravatar for dennis.guettler
23 months ago by
dennis.guettler0 wrote:

Hi Community.

Context: I used MiSeq for sequencing a metagenome with paired end setup. But due to overclustering, my reverse reads can't be matched by the adaptor sequence because the quality isn't good.

Approach: I demultiplexed the forward (R1) reads based on adaptor sequence - which was quite possible - and want to use the header information/Sequence identifier to find the reverse reads (R2) in the second file, because they have similar header names until a certain position. For this i extracted the header information out of R1 to the point, that they are similar in R2.

Concrete: An example for the extracted R1 Header (Let's call it "Forward.txt") is


The R2 file (Let's call it "Reverse.fasta") looks like this


Aim: I want to collect all the sequences + header information from Reverse.fasta based on the header information from forward.txt.

What was done so far: I searched a real long time, and found a python script which deals with that problem, but I can't adapt it perfectly, to solve my problem. ( I adapted it:

f2 = open('Forward.txt','r')
f1 = open('Reverse.fasta','r')
f3 = open('FindRV.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('_')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            skip = 0
            skip = 1
        if not skip:


Problem: I Have multiple demultiplexed Forward.txt files with absolute same structure but sometimes the output file has quite some sequences (like I hoped for) in it and sometimes there are zero. For the files with zero output, I manually searched for the sequence header in geneious11.1.4 and found sequences, so it's not a problem that there aren't any, it seems to be a problem of the code itself.

Hope: Iam no coder and tried and searched a long time for the right solution but now i'am desperate enough to write here ^^ hope anyone understands my issue and maybe has a solution.


ADD COMMENTlink modified 23 months ago by Dattatray Mongad350 • written 23 months ago by dennis.guettler0
gravatar for Dattatray Mongad
23 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad350 wrote:

use fastq-pair, it gives the only reads which are in forward and reverse files. So, rather than making forward.txt(headers), make forward.fastq(containg reads for which you want reverse reads to be extracted) and run the program.

This will make four files in the directory:

  1. forward.fastq.paired.fq
  2. forward.fastq.single.fq
  3. reverse.fastq.paired.fq
  4. reverse.fastq.single.fq

What your seeking will be in 3rd file.

ADD COMMENTlink written 23 months ago by Dattatray Mongad350

Thanks for another option. the internet is so full of tools that one sometimes get lost :) I will try this

ADD REPLYlink written 23 months ago by dennis.guettler0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 954 users visited in the last hour