Reverse complement for multiple fastq files
2
0
Entering edit mode
6.1 years ago
teleos27 • 0

Hello! I have a data set of various fastq files, all of these are forward readings:

ERS1355434.fq ERS1355435.fq ERS1355436.fq ERS1355437.fq ERS1355439.fq ERS1355440.fq ERS1355442.fq

I want to get the reverse complement for these fastq sequences.

I am using biopython to do this, so for one fastq I am doing the following:

from Bio import SeqIO
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
for rec in
SeqIO.parse('ERS1355454.fq','fastq'))
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')

And this works, but now I am struggling with figuring out a way to do the same on all the fastq files. I am thinking on either run a loop or write a function and loop it through all the files. I am at beginning stage and not sure how to approach it, if someone has any suggestions on how to do this it will help a lot!

Thanks

biopython • 4.6k views
ADD COMMENT
0
Entering edit mode

there is no question.

ADD REPLY
2
Entering edit mode
6.1 years ago
Joe 21k

Replace your file name with sys.argv[1], import the sys module, and then save your script to call it in a loop from the command line.

Edit: here's the full code you'll need. I discovered that upon reverse complementing, BioPython throws away the header information for some reason, so you need to hack that a bit:

import sys
from Bio import SeqIO

recs = SeqIO.parse(sys.argv[1], 'fastq')

for rec in recs:
    rc = rec.reverse_complement()
    rc.id = rec.id
    rc.name = rec.name + '_R2.fastq'
    rc.description = ''
    SeqIO.write(rc, rc.name, 'fastq')

Call this in a shell loop to run over all your sequences.:

for file in /path/to/*.fastq ; do
    python reverse_complement.py $file
done

(You can do this in a one-liner at the shell prompt if you wish too: for file in /path/to/*.fastq ; do python reverse_complement.py $file ; done)

NOTE

This will make a new file for every reverse complemented read (you can change rc.name in the last line if you don't want this). Also, this will mean the forward and reverse sequences will share the same fastq header name, which may not be ideal. In which case, edit rc.id = rec.id + '_some_string' or whatever you need.

ADD COMMENT
1
Entering edit mode

The annotation will not in general apply to the RC sequence, so in the .reverse_complement(...) arguments you can provide new values (as in the updated original question) or use True to preserve a value. See the method's documentation:

http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html#reverse_complement

ADD REPLY
0
Entering edit mode

Thanks so much for your response! I replaced my file name with sys.argv[1] and import sys, and saved the script. Then I created a bash script:

#!/bin/bash
for i in files;do
   ./reverse_comp.py 
done

I run the script and I get the following error:

Traceback (most recent call last):
  File "./reverse_comp.py", line 7, in <module>
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
IndexError: list index out of range

Do you have any suggestions on what I'm doing wrong?

ADD REPLY
0
Entering edit mode

You need to pass the file to the script as the first argument:

python reverse_comp.py $i

You get index out of range because [1] evaluates to nothing in your loop.

ADD REPLY
0
Entering edit mode

Thanks for your response! I am so sorry for this, but I have a different error for something else. I really appreciate your thoughts on this. I now get this error:

Traceback (most recent call last):
 File "./reverse_comp.py", line 9, in <module>
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')
NameError: name 'seq_record' is not defined

So I defined my seq_record by doing this:

!/usr/bin/env python
from Bio import SeqIO
import sys
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
identifiers=[seq_record.id for seq_record in SeqIO.parse(sys.argv[1],'fastq')]
SeqIO.write(records,identifiers + "R2.fastq",'fastq')

But I get this error:

IOError: [Errno 2] No such file or directory: 'files'

Thanks so much for your help!

ADD REPLY
0
Entering edit mode

That code is all over the place sorry! - so I've just written some from scratch in my edited answer above.

ADD REPLY
0
Entering edit mode

Yes I know :( Thank you so much!

ADD REPLY
1
Entering edit mode
6.1 years ago
GenoMax 141k

Since the main question is about reverse complementing the reads another option is reformat.sh from BBMap suite. Use a shell loop to do more than one file.

reformat.sh in=file.fq.gz out=file_rcmp.fq.gz rcomp=t

If you only want to RC read 2 then use this instead of rcomp

rcompmate=t             (rcm) Reverse-compliment read 2 only
ADD COMMENT
0
Entering edit mode

Thanks for your suggestion I'll take a look at this.

ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6