Question: Reverse complement for multiple fastq files
0
gravatar for teleos27
14 months ago by
teleos270
teleos270 wrote:

Hello! I have a data set of various fastq files, all of these are forward readings:

ERS1355434.fq ERS1355435.fq ERS1355436.fq ERS1355437.fq ERS1355439.fq ERS1355440.fq ERS1355442.fq

I want to get the reverse complement for these fastq sequences.

I am using biopython to do this, so for one fastq I am doing the following:

from Bio import SeqIO
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
for rec in
SeqIO.parse('ERS1355454.fq','fastq'))
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')

And this works, but now I am struggling with figuring out a way to do the same on all the fastq files. I am thinking on either run a loop or write a function and loop it through all the files. I am at beginning stage and not sure how to approach it, if someone has any suggestions on how to do this it will help a lot!

Thanks

biopython • 989 views
ADD COMMENTlink modified 14 months ago by genomax67k • written 14 months ago by teleos270

there is no question.

ADD REPLYlink modified 14 months ago • written 14 months ago by Pierre Lindenbaum120k
2
gravatar for jrj.healey
14 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Replace your file name with sys.argv[1], import the sys module, and then save your script to call it in a loop from the command line.

Edit: here's the full code you'll need. I discovered that upon reverse complementing, BioPython throws away the header information for some reason, so you need to hack that a bit:

import sys
from Bio import SeqIO

recs = SeqIO.parse(sys.argv[1], 'fastq')

for rec in recs:
    rc = rec.reverse_complement()
    rc.id = rec.id
    rc.name = rec.name + '_R2.fastq'
    rc.description = ''
    SeqIO.write(rc, rc.name, 'fastq')

Call this in a shell loop to run over all your sequences.:

for file in /path/to/*.fastq ; do
    python reverse_complement.py $file
done

(You can do this in a one-liner at the shell prompt if you wish too: for file in /path/to/*.fastq ; do python reverse_complement.py $file ; done)

NOTE

This will make a new file for every reverse complemented read (you can change rc.name in the last line if you don't want this). Also, this will mean the forward and reverse sequences will share the same fastq header name, which may not be ideal. In which case, edit rc.id = rec.id + '_some_string' or whatever you need.

ADD COMMENTlink modified 14 months ago • written 14 months ago by jrj.healey12k
1

The annotation will not in general apply to the RC sequence, so in the .reverse_complement(...) arguments you can provide new values (as in the updated original question) or use True to preserve a value. See the method's documentation:

http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html#reverse_complement

ADD REPLYlink modified 14 months ago • written 14 months ago by Peter5.8k

Thanks so much for your response! I replaced my file name with sys.argv[1] and import sys, and saved the script. Then I created a bash script:

#!/bin/bash
for i in files;do
   ./reverse_comp.py 
done

I run the script and I get the following error:

Traceback (most recent call last):
  File "./reverse_comp.py", line 7, in <module>
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
IndexError: list index out of range

Do you have any suggestions on what I'm doing wrong?

ADD REPLYlink modified 14 months ago by genomax67k • written 14 months ago by teleos270

You need to pass the file to the script as the first argument:

python reverse_comp.py $i

You get index out of range because [1] evaluates to nothing in your loop.

ADD REPLYlink written 14 months ago by jrj.healey12k

Thanks for your response! I am so sorry for this, but I have a different error for something else. I really appreciate your thoughts on this. I now get this error:

Traceback (most recent call last):
 File "./reverse_comp.py", line 9, in <module>
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')
NameError: name 'seq_record' is not defined

So I defined my seq_record by doing this:

!/usr/bin/env python
from Bio import SeqIO
import sys
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
identifiers=[seq_record.id for seq_record in SeqIO.parse(sys.argv[1],'fastq')]
SeqIO.write(records,identifiers + "R2.fastq",'fastq')

But I get this error:

IOError: [Errno 2] No such file or directory: 'files'

Thanks so much for your help!

ADD REPLYlink modified 14 months ago • written 14 months ago by teleos270

That code is all over the place sorry! - so I've just written some from scratch in my edited answer above.

ADD REPLYlink written 14 months ago by jrj.healey12k

Yes I know :( Thank you so much!

ADD REPLYlink written 14 months ago by teleos270
1
gravatar for genomax
14 months ago by
genomax67k
United States
genomax67k wrote:

Since the main question is about reverse complementing the reads another option is reformat.sh from BBMap suite. Use a shell loop to do more than one file.

reformat.sh in=file.fq.gz out=file_rcmp.fq.gz rcomp=t

If you only want to RC read 2 then use this instead of rcomp

rcompmate=t             (rcm) Reverse-compliment read 2 only
ADD COMMENTlink modified 14 months ago • written 14 months ago by genomax67k

Thanks for your suggestion I'll take a look at this.

ADD REPLYlink written 14 months ago by teleos270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 850 users visited in the last hour