Question

How can I parse GFF3 using BCBio?

0

Entering edit mode

4.2 years ago

radha.jg • 0

I have the following code:

import sys
from BCBio import GFF
file = sys.argv[1]
for record in GFF.parse(file):
      print([record.seqid,record.source, record.start, record.end, record.strand])

My question is how can extract the ids, the chromosome, the coordinates and the strand from a GFF3 files using python/biopython (I'm bounded to python because this is part of a bigger program writen in said language). I thought since record.seqid, record.source, record.start, record.end, record.strand is used to write new GFF, this could work, but clearly it doesn't.

I need this solved ASAP but also I need to learn, so please, anyone out there, please help.

Thanks in advance

Jenifer

biopython • 2.8k views

ADD COMMENT • link updated 4.1 years ago by Biostar 20 • written 4.2 years ago by radha.jg • 0

0

Entering edit mode

Try this package in python - https://pythonhosted.org/gffutils/

This package creates a "local" database from your gff file that you can interact with in relatively easier manner. For instance, I was able to extract start stop position of gene along with exons coding for that gene and their start and stop positions. Hope this helps!

ADD REPLY • link 4.2 years ago by manaswwm ▴ 490

0

Entering edit mode

Ok, so now what I need is an iterator to get all CDS with all start, end and and strand for a chromosome. Does this library provide one, or how do I get this.

Context: For each chromosome I calculated the GC_Skew and now I need to show in that plot, the zones that have CDS, distinguishing the zones that have CDS and the strand. The function:

def garphicale(GC_skew,window,overlap,anotation,salida):
    x_axis = list()
    for i in range(0,len(GC_skew)):
        x_axis.append(i)

    plt.scatter(x_axis,GC_skew)

#Anotation only can be None or a gffutils database
    if anotation != None:
        db = gffutils.FeatureDB(anotation, keep_order=True)

        #here has to get the start, end and strand of all CDS for the "salida" seqids
        #here the window and overlap play they roles because it has to plot in the gc-skew
        #on the corresponding coordinates and the color indicates the strand
        #here has to add in the GC_skew graph a diferent color for CDS

    plt.savefig(salida)

GC_Skew parameter is a list with the GC-skew calculation, window and overlap is the size of the window and the overlap, salida is the name of the chromosome (seqid) and anotation can be None or a created database like this: anotation = gffutils.create_db(fn, dbfn = anotation_file) So the thing is I don't have a clear idea how gffutils works and I don't know if I can get all CDS to add that data to the plot If you can give me a hand in this, I will apreciate it

ADD REPLY • link 4.2 years ago by radha.jg • 0

0

Entering edit mode

The tutorial (https://pythonhosted.org/gffutils/) is very straightforward, for example let us presume that the local database that you have created is by the name db and the gene whose CDS you would like to extract is assigned name gene, then you can accomplish what you want to by the following:

for item in db.children(gene, featuretype="CDS"):
      print(item.start) #item becomes an object with features like start, end, strand etc

See if this helps! To be honest I do not have an entire idea of what you are looking to do, if this does not fix your problem then maybe elaborate your next problem and we can see if we can help

ADD REPLY • link 4.2 years ago by manaswwm ▴ 490

0

Entering edit mode

Yes, I saw that, so now what I need to do is to, get a list of genes for one seqid (salida parameter), from the gff3 database. Then construct a list of tuples for each gene and then, add that data to the plot... Mmm I really need to meditate how to do this

EDIT: ok, thing is, I just have the chromosome (seqid) so in order for your solution to work, I need to get a list of the genes, but i don't know how to get it. The only thing I know is, for my script so work, is something like this:

def garphicale(GC_skew,window,overlap,anotation,salida):
    x_axis = list()
    for i in range(0,len(GC_skew)):
        x_axis.append(i)

    plt.scatter(x_axis,GC_skew)

#Anotation only can be None or a gffutils database
    if anotation != None:
        genes = #list of all CDS from the "salida" seqid
        for gene in genes:
            for item in anotation.children(gene, featuretype="CDS"):
                coord, strand = (item.start//(window-overlap))-1, item.strand
                if strand == "+":
                    plt.scatter(coord,GC_skew[coord],color = red)
                if strand == "-":
                    plt.scatter(coord,GC_skew[coord],color = black)
        #here has to get the start, end and strand of all CDS for the "salida" seqids
        #here the window and overlap play they roles because it has to plot in the gc-skew
        #on the corresponding coordinates and the color indicates the strand
        #here has to add in the GC_skew graph a diferent color for CDS

    plt.savefig(salida)