Question: Gc Content In Mate Pairs / Pair-End
0
gravatar for Ric
7.5 years ago by
Ric270
Australia
Ric270 wrote:

Hello, In BioPython it is possible to calculate the GC content from a sequence in the following way:

>>> from Bio.SeqUtils import GC
>>> GC("ACTGN")
40.0

However, how to do it with mate pair reads? Should, I only consider the A read only and ignore the B read?

Thank you in advance.

paired gc biopython • 2.4k views
ADD COMMENTlink modified 2.6 years ago by Biostar ♦♦ 20 • written 7.5 years ago by Ric270
1
gravatar for John St. John
7.5 years ago by
John St. John1.1k
San Francisco, CA, Cancer Therapeutics Innovation Group
John St. John1.1k wrote:

You could use the .next() method of python's iterator to go over two sequences at a time with something like this:

from Bio import SeqIO
handle1 = open("example_1.fq", "rU")
handle2 = open("example_2.fq", "rU")
itter1 = SeqIO.parse(handle1, "fastq")
itter2 = SeqIO.parse(handle2, "fastq")
for record1 in itter1 :
    record2 = itter2.next()
    #Calculate GC on record 1 and record 2
handle1.close()
handle2.close()

This will work as long as the reads are sorted in your two fastq files. This is pretty typical for what comes off of the illumina machine. Another option if your fastq file of mates is interleaved, you could do something similar but with a single handle:

from Bio import SeqIO
handle = open("example_interleaved.fq", "rU")
fq_itter = SeqIO.parse(handle, "fastq")
for record1 in fq_itter:
    record2 = fq_itter.next()
    #Calculate GC on record 1 and record 2
handle.close()

Since for bla in bla_itter just calls .next() over and over again on bla_itter you don't get issues with seeing the same record twice, if you call .next inside of the inner loop it moves the outer loop forward an extra position as well. That might not be technically correct, but that is how I have seen python behave.

ADD COMMENTlink written 7.5 years ago by John St. John1.1k

Thank you for the code.

ADD REPLYlink written 7.5 years ago by Ric270
0
gravatar for Fidel
7.5 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

Depends on what you want. If you want to compute the GC content of each read, then take into account both reads. This is what quality control software like FastQC does.

If you want to compute the GC content of the fragment defined by the paired reads, then use that sequence instead.

ADD COMMENTlink written 7.5 years ago by Fidel1.9k

Thank you for your response.

READ a: @aaa ACTG + hhsl

READ b: @bbb TGCT + hhsl

Does FastQC concatenate READ a and b together in the following way:

READ a + b: @aaa+bbb ACTGTGCT + hhslhhsl

and it calculate the GC content?

What do you mean to compute the GC content of the fragment defined by the paired reads and for what could I used it?

ADD REPLYlink written 7.5 years ago by Ric270

Thank you for your response.

READ a: @aaa ACTG + hhsl

READ b: @bbb TGCT + hhsl

Does FastQC concatenate READ a and b together in the following way:

READ a + b: @aaa+bbb ACTGTGCT + hhslhhsl

and it calculate the GC content?

What do you mean to compute the GC content of the fragment defined by the paired reads and for what could I used it?

ADD REPLYlink written 7.5 years ago by Ric270

As far as I know FastQC takes each read independently. I see no reason for concatenating the two reads one after the other. If you have mate pairs, from paired-end sequencing assay then, they represent the two extremes of an original fragment. See http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn

Can you explain why do you want to compute GC content?

ADD REPLYlink written 7.5 years ago by Fidel1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1032 users visited in the last hour