Question: Reverse complement for multiple fastq files
0
gravatar for teleos27
2.2 years ago by
teleos270
teleos270 wrote:

Hello! I have a data set of various fastq files, all of these are forward readings:

ERS1355434.fq ERS1355435.fq ERS1355436.fq ERS1355437.fq ERS1355439.fq ERS1355440.fq ERS1355442.fq

I want to get the reverse complement for these fastq sequences.

I am using biopython to do this, so for one fastq I am doing the following:

from Bio import SeqIO
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
for rec in
SeqIO.parse('ERS1355454.fq','fastq'))
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')

And this works, but now I am struggling with figuring out a way to do the same on all the fastq files. I am thinking on either run a loop or write a function and loop it through all the files. I am at beginning stage and not sure how to approach it, if someone has any suggestions on how to do this it will help a lot!

Thanks

biopython • 1.6k views
ADD COMMENTlink modified 2.2 years ago by genomax83k • written 2.2 years ago by teleos270

there is no question.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum128k
2
gravatar for Joe
2.2 years ago by
Joe16k
United Kingdom
Joe16k wrote:

Replace your file name with sys.argv[1], import the sys module, and then save your script to call it in a loop from the command line.

Edit: here's the full code you'll need. I discovered that upon reverse complementing, BioPython throws away the header information for some reason, so you need to hack that a bit:

import sys
from Bio import SeqIO

recs = SeqIO.parse(sys.argv[1], 'fastq')

for rec in recs:
    rc = rec.reverse_complement()
    rc.id = rec.id
    rc.name = rec.name + '_R2.fastq'
    rc.description = ''
    SeqIO.write(rc, rc.name, 'fastq')

Call this in a shell loop to run over all your sequences.:

for file in /path/to/*.fastq ; do
    python reverse_complement.py $file
done

(You can do this in a one-liner at the shell prompt if you wish too: for file in /path/to/*.fastq ; do python reverse_complement.py $file ; done)

NOTE

This will make a new file for every reverse complemented read (you can change rc.name in the last line if you don't want this). Also, this will mean the forward and reverse sequences will share the same fastq header name, which may not be ideal. In which case, edit rc.id = rec.id + '_some_string' or whatever you need.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Joe16k
1

The annotation will not in general apply to the RC sequence, so in the .reverse_complement(...) arguments you can provide new values (as in the updated original question) or use True to preserve a value. See the method's documentation:

http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html#reverse_complement

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Peter5.8k

Thanks so much for your response! I replaced my file name with sys.argv[1] and import sys, and saved the script. Then I created a bash script:

#!/bin/bash
for i in files;do
   ./reverse_comp.py 
done

I run the script and I get the following error:

Traceback (most recent call last):
  File "./reverse_comp.py", line 7, in <module>
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
IndexError: list index out of range

Do you have any suggestions on what I'm doing wrong?

ADD REPLYlink modified 2.2 years ago by genomax83k • written 2.2 years ago by teleos270

You need to pass the file to the script as the first argument:

python reverse_comp.py $i

You get index out of range because [1] evaluates to nothing in your loop.

ADD REPLYlink written 2.2 years ago by Joe16k

Thanks for your response! I am so sorry for this, but I have a different error for something else. I really appreciate your thoughts on this. I now get this error:

Traceback (most recent call last):
 File "./reverse_comp.py", line 9, in <module>
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')
NameError: name 'seq_record' is not defined

So I defined my seq_record by doing this:

!/usr/bin/env python
from Bio import SeqIO
import sys
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
identifiers=[seq_record.id for seq_record in SeqIO.parse(sys.argv[1],'fastq')]
SeqIO.write(records,identifiers + "R2.fastq",'fastq')

But I get this error:

IOError: [Errno 2] No such file or directory: 'files'

Thanks so much for your help!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by teleos270

That code is all over the place sorry! - so I've just written some from scratch in my edited answer above.

ADD REPLYlink written 2.2 years ago by Joe16k

Yes I know :( Thank you so much!

ADD REPLYlink written 2.2 years ago by teleos270
1
gravatar for genomax
2.2 years ago by
genomax83k
United States
genomax83k wrote:

Since the main question is about reverse complementing the reads another option is reformat.sh from BBMap suite. Use a shell loop to do more than one file.

reformat.sh in=file.fq.gz out=file_rcmp.fq.gz rcomp=t

If you only want to RC read 2 then use this instead of rcomp

rcompmate=t             (rcm) Reverse-compliment read 2 only
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by genomax83k

Thanks for your suggestion I'll take a look at this.

ADD REPLYlink written 2.2 years ago by teleos270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1133 users visited in the last hour