Gc Content In Mate Pairs / Pair-End
2
0
Entering edit mode
9.6 years ago
Ric ▴ 360

Hello, In BioPython it is possible to calculate the GC content from a sequence in the following way:

>>> from Bio.SeqUtils import GC
>>> GC("ACTGN")
40.0

However, how to do it with mate pair reads? Should, I only consider the A read only and ignore the B read?

Thank you in advance.

biopython paired gc • 3.0k views
ADD COMMENT
1
Entering edit mode
9.6 years ago
John St. John ★ 1.2k

You could use the .next() method of python's iterator to go over two sequences at a time with something like this:

from Bio import SeqIO
handle1 = open("example_1.fq", "rU")
handle2 = open("example_2.fq", "rU")
itter1 = SeqIO.parse(handle1, "fastq")
itter2 = SeqIO.parse(handle2, "fastq")
for record1 in itter1 :
    record2 = itter2.next()
    #Calculate GC on record 1 and record 2
handle1.close()
handle2.close()

This will work as long as the reads are sorted in your two fastq files. This is pretty typical for what comes off of the illumina machine. Another option if your fastq file of mates is interleaved, you could do something similar but with a single handle:

from Bio import SeqIO
handle = open("example_interleaved.fq", "rU")
fq_itter = SeqIO.parse(handle, "fastq")
for record1 in fq_itter:
    record2 = fq_itter.next()
    #Calculate GC on record 1 and record 2
handle.close()

Since for bla in bla_itter just calls .next() over and over again on bla_itter you don't get issues with seeing the same record twice, if you call .next inside of the inner loop it moves the outer loop forward an extra position as well. That might not be technically correct, but that is how I have seen python behave.

ADD COMMENT
0
Entering edit mode

Thank you for the code.

ADD REPLY
0
Entering edit mode
9.6 years ago
Fidel ★ 2.0k

Depends on what you want. If you want to compute the GC content of each read, then take into account both reads. This is what quality control software like FastQC does.

If you want to compute the GC content of the fragment defined by the paired reads, then use that sequence instead.

ADD COMMENT
0
Entering edit mode

Thank you for your response.

READ a: @aaa ACTG + hhsl

READ b: @bbb TGCT + hhsl

Does FastQC concatenate READ a and b together in the following way:

READ a + b: @aaa+bbb ACTGTGCT + hhslhhsl

and it calculate the GC content?

What do you mean to compute the GC content of the fragment defined by the paired reads and for what could I used it?

ADD REPLY
0
Entering edit mode

Thank you for your response.

READ a: @aaa ACTG + hhsl

READ b: @bbb TGCT + hhsl

Does FastQC concatenate READ a and b together in the following way:

READ a + b: @aaa+bbb ACTGTGCT + hhslhhsl

and it calculate the GC content?

What do you mean to compute the GC content of the fragment defined by the paired reads and for what could I used it?

ADD REPLY
0
Entering edit mode

As far as I know FastQC takes each read independently. I see no reason for concatenating the two reads one after the other. If you have mate pairs, from paired-end sequencing assay then, they represent the two extremes of an original fragment. See http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn

Can you explain why do you want to compute GC content?

ADD REPLY

Login before adding your answer.

Traffic: 2185 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6