Question

Reverse complement for multiple fastq files

0

Entering edit mode

7.3 years ago

teleos27 • 0

Hello! I have a data set of various fastq files, all of these are forward readings:

ERS1355434.fq ERS1355435.fq ERS1355436.fq ERS1355437.fq ERS1355439.fq ERS1355440.fq ERS1355442.fq

I want to get the reverse complement for these fastq sequences.

I am using biopython to do this, so for one fastq I am doing the following:

from Bio import SeqIO
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
for rec in
SeqIO.parse('ERS1355454.fq','fastq'))
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')

And this works, but now I am struggling with figuring out a way to do the same on all the fastq files. I am thinking on either run a loop or write a function and loop it through all the files. I am at beginning stage and not sure how to approach it, if someone has any suggestions on how to do this it will help a lot!

Thanks

biopython • 5.7k views

ADD COMMENT • link updated 7.3 years ago by GenoMax 152k • written 7.3 years ago by teleos27 • 0

0

Entering edit mode

~~there is no question.~~

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

7.3 years ago

GenoMax 152k

Since the main question is about reverse complementing the reads another option is reformat.sh from BBMap suite. Use a shell loop to do more than one file.

reformat.sh in=file.fq.gz out=file_rcmp.fq.gz rcomp=t

If you only want to RC read 2 then use this instead of rcomp

rcompmate=t             (rcm) Reverse-compliment read 2 only

ADD COMMENT • link 7.3 years ago by GenoMax 152k

0

Entering edit mode

Thanks for your suggestion I'll take a look at this.

ADD REPLY • link 7.3 years ago by teleos27 • 0

GenoMax · Accepted Answer · 2018-03-14

2

Entering edit mode

7.3 years ago

Joe 22k

Replace your file name with sys.argv[1], import the sys module, and then save your script to call it in a loop from the command line.

Edit: here's the full code you'll need. I discovered that upon reverse complementing, BioPython throws away the header information for some reason, so you need to hack that a bit:

import sys
from Bio import SeqIO

recs = SeqIO.parse(sys.argv[1], 'fastq')

for rec in recs:
    rc = rec.reverse_complement()
    rc.id = rec.id
    rc.name = rec.name + '_R2.fastq'
    rc.description = ''
    SeqIO.write(rc, rc.name, 'fastq')

Call this in a shell loop to run over all your sequences.:

for file in /path/to/*.fastq ; do
    python reverse_complement.py $file
done

(You can do this in a one-liner at the shell prompt if you wish too: for file in /path/to/*.fastq ; do python reverse_complement.py $file ; done)

NOTE

This will make a new file for every reverse complemented read (you can change rc.name in the last line if you don't want this). Also, this will mean the forward and reverse sequences will share the same fastq header name, which may not be ideal. In which case, edit rc.id = rec.id + '_some_string' or whatever you need.

ADD COMMENT • link 7.3 years ago by Joe 22k

1

Entering edit mode

The annotation will not in general apply to the RC sequence, so in the .reverse_complement(...) arguments you can provide new values (as in the updated original question) or use True to preserve a value. See the method's documentation:

http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html#reverse_complement

ADD REPLY • link 7.3 years ago by Peter 6.0k

0

Entering edit mode

Thanks so much for your response! I replaced my file name with sys.argv[1] and import sys, and saved the script. Then I created a bash script:

#!/bin/bash
for i in files;do
   ./reverse_comp.py 
done

I run the script and I get the following error:

Traceback (most recent call last):
  File "./reverse_comp.py", line 7, in <module>
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
IndexError: list index out of range

Do you have any suggestions on what I'm doing wrong?

ADD REPLY • link updated 7.3 years ago by GenoMax 152k • written 7.3 years ago by teleos27 • 0

0

Entering edit mode

You need to pass the file to the script as the first argument:

python reverse_comp.py $i

You get index out of range because [1] evaluates to nothing in your loop.

ADD REPLY • link 7.3 years ago by Joe 22k

0

Entering edit mode

Thanks for your response! I am so sorry for this, but I have a different error for something else. I really appreciate your thoughts on this. I now get this error:

Traceback (most recent call last):
 File "./reverse_comp.py", line 9, in <module>
SeqIO.write(records,seq_record.id + "R2.fastq",'fastq')
NameError: name 'seq_record' is not defined

So I defined my seq_record by doing this:

!/usr/bin/env python
from Bio import SeqIO
import sys
records=(rec.reverse_complement(id="rc_"+rec.id, description='reverse complement')
    for rec in SeqIO.parse(sys.argv[1],'fastq'))
identifiers=[seq_record.id for seq_record in SeqIO.parse(sys.argv[1],'fastq')]
SeqIO.write(records,identifiers + "R2.fastq",'fastq')

But I get this error:

IOError: [Errno 2] No such file or directory: 'files'

Thanks so much for your help!

ADD REPLY • link 7.3 years ago by teleos27 • 0

0

Entering edit mode

That code is all over the place sorry! - so I've just written some from scratch in my edited answer above.