Question: Extract Reverse Reads Based on Header Information of Forward Reads
0
gravatar for dennis.guettler
13 months ago by
dennis.guettler0 wrote:

Hi Community.

Context: I used MiSeq for sequencing a metagenome with paired end setup. But due to overclustering, my reverse reads can't be matched by the adaptor sequence because the quality isn't good.

Approach: I demultiplexed the forward (R1) reads based on adaptor sequence - which was quite possible - and want to use the header information/Sequence identifier to find the reverse reads (R2) in the second file, because they have similar header names until a certain position. For this i extracted the header information out of R1 to the point, that they are similar in R2.

Concrete: An example for the extracted R1 Header (Let's call it "Forward.txt") is

M05640:9:000000000-B6W7Y:1:1101:15044:1347
M05640:9:000000000-B6W7Y:1:1101:15768:1360
M05640:9:000000000-B6W7Y:1:1101:14907:1368
M05640:9:000000000-B6W7Y:1:1101:14347:1398
...

The R2 file (Let's call it "Reverse.fasta") looks like this

>M05640:9:000000000-B6W7Y:1:1101:14170:1443_2:N:0:0
GGCTAGCATTTTCGGAGTTCCTCATCTGGACAG
>M05640:9:000000000-B6W7Y:1:1101:14362:1443_2:N:0:0
GTCAGCCCGTTCGCCCATCGCCGTCTCAAGAGTGATC
>M05640:9:000000000-B6W7Y:1:1101:14362:1443_2:N:0:0
GTCAGCCCGTTCGCCCATCGCCGTCTCAAGAGTGAT
>M05640:9:000000000-B6W7Y:1:1101:13983:1444_2:N:0:0
TTCCGGCGAGCGTGGACGGCTTGTGCATGGGGAACTT
....

Aim: I want to collect all the sequences + header information from Reverse.fasta based on the header information from forward.txt.

What was done so far: I searched a real long time, and found a python script which deals with that problem, but I can't adapt it perfectly, to solve my problem. (https://stackoverflow.com/questions/15352219/extract-sequences-from-a-fasta-file-based-on-entries-in-a-separate-file) I adapted it:

f2 = open('Forward.txt','r')
f1 = open('Reverse.fasta','r')
f3 = open('FindRV.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('_')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

Problem: I Have multiple demultiplexed Forward.txt files with absolute same structure but sometimes the output file has quite some sequences (like I hoped for) in it and sometimes there are zero. For the files with zero output, I manually searched for the sequence header in geneious11.1.4 and found sequences, so it's not a problem that there aren't any, it seems to be a problem of the code itself.

Hope: Iam no coder and tried and searched a long time for the right solution but now i'am desperate enough to write here ^^ hope anyone understands my issue and maybe has a solution.

Greetings

ADD COMMENTlink modified 13 months ago by Dattatray Mongad320 • written 13 months ago by dennis.guettler0
0
gravatar for Dattatray Mongad
13 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad320 wrote:

use fastq-pair, it gives the only reads which are in forward and reverse files. So, rather than making forward.txt(headers), make forward.fastq(containg reads for which you want reverse reads to be extracted) and run the program.

This will make four files in the directory:

  1. forward.fastq.paired.fq
  2. forward.fastq.single.fq
  3. reverse.fastq.paired.fq
  4. reverse.fastq.single.fq

What your seeking will be in 3rd file.

ADD COMMENTlink written 13 months ago by Dattatray Mongad320

Thanks for another option. the internet is so full of tools that one sometimes get lost :) I will try this

ADD REPLYlink written 13 months ago by dennis.guettler0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour