iterating through 2 fastq files and changing read_id in one of them
1
0
Entering edit mode
7.8 years ago

I have two fastq files A.fastq and B.fastq. I want to 1. choose reads from B.fastq if the read_id is present in A.fastq 2. If a read is chosen in B.fastq, I want to modify the read_id of that read by appending to it the (after an underscore) sequence of the correspondig read in A.fastq

I wrote the following which accomplishes task 1 but not task 2.

results = []

chosen = set()

for s in Bio.SeqIO.parse("A.fastq","fastq"):

     chosen.adds.id)

for s in Bio.SeqIO.parse("B.fastq","fastq"):

    if s.id in chosen:
        #####HOW TO MODIFY s.id here to s.id+"_"+sequence from A.fastq for that read_id

        results.append(s)

output_handle = open("B_filt.fastq","w")

SeqIO.write(results,output_handle,"fastq")

output_handle.close()

Please suggest a fix.

next-gen • 1.5k views
ADD COMMENT
2
Entering edit mode
7.8 years ago

This is under the premise that your data is not too large, because we are doing everything in memory.

You did not keep the sequences of the A.fastq anywhere, so you cannot access them. A possibility would be to generate a dictionary while looping over A.fastq, e.g.:

Adict = {}
for s in Bio.SeqIO.parse("A.fastq","fastq"):
     Adict[s.id] = s.seq
chosen = set(Adict.keys())
for s in Bio.SeqIO.parse("B.fastq","fastq"):
    if s.id in chosen:
         s.id = s.id + '_' + Adict[s.id]
         results.append(s)

Another piece of advice, don't use 'weak' variable names like 's', because those don't describe what they are about. Use something longer which is (in a few months time) far easier to read. In addition, don't use the same variable name twice. It probably will not interfere with how your code works, but might give you a headache while debugging and again, in a few months time, it will make things far harder to read and to remember what exactly your code does.

An important part of Python is readability, make sure your code tells the story.

What I also like a lot are dict comprehensions, the first 3 lines of my code above can be written like: Adict = {s.id : s.seq for s in Bio.SeqIO.parse("A.fastq","fastq")}

Which is rather compact :)

ADD COMMENT

Login before adding your answer.

Traffic: 2450 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6