Question: How to parse GFF3 file and get start and end coordinates in Python
1
gravatar for rimjhim.roy.ch
5.2 years ago by
European Union
rimjhim.roy.ch70 wrote:

I am new to Python. I want to create a new GFF3 file from an existing file by filtering out features that are smaller than 1 kb. 

Is there a way to parse a GFF3 file containing feature information of semicolon separated tags as well as start and end of the feature, and create a new file altogether?

My gff file looks like below:

seq_chr1    S-MART    match    158337    160567    .    -    .    Superfamily=LINE;Target=RIL-Map20 356 2619;ID=ms1_seq_chr1_RIL-Map20;Order=TE;Class=Unknown;Identity=93.9881;Name=ms1_seq_chr1_RIL-Map20

 

parse python gff3 • 6.5k views
ADD COMMENTlink modified 20 months ago by Biostar ♦♦ 20 • written 5.2 years ago by rimjhim.roy.ch70
3
gravatar for Ryan Dale
5.2 years ago by
Ryan Dale4.9k
Bethesda, MD
Ryan Dale4.9k wrote:

As your analysis gets more complex, you might find gffutils useful for accessing GFF attributes. It has a fairly robust GFF/GTF parser. Here's an example that only keeps >1kb features that are in the LINE superfamily (as you suggest is your goal in a comment on another answer).

(edit: disclaimer, I'm the primary author)

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Ryan Dale4.9k
1
gravatar for Daniel
5.2 years ago by
Daniel3.8k
Cardiff University
Daniel3.8k wrote:

This is a simple perl script to filter by size which I had lying around which could do that, but python would be just as simple to learn for the long run. I recommend www.pythonforbiologists.com

#!/usr/bin/perl
use strict;
use warnings;

my $usage = "## USAGE:  filter_csv.pl infile.txt [min value]\n";

my $infile = $ARGV[0] or die $usage;
my $filter = $ARGV[1] or die $usage;

open INFILE, "<", "$infile" or die "$usage | check $infile\n";

my $n=0;

while (my $line = <INFILE>){
    $n++;
    if ($n == 1){
        ## Don't process on header line (Maybe you dont need this bit)
        print $line;
    }else{
        my @l = split(/\t/, $line);
        if ($l[4] - $l[3] >= $filter){
            print $line;
        }
    }
}

You would run it like so:

./filter_csv.pl my_file.gff 1000

You could add in a filter for the right term by using a similar if clause and adding it in the size loop.

EDIT: For example (but not tested):

 

while (my $line = <INFILE>){
    $n++;
    if ($n == 1){
        ## Don't process on header line (Maybe you dont need this bit)
        print $line;
    }else{
       my @l = split(/\t/, $line);
       if ($l[4] - $l[3] >= $filter) && ($line ~= /$term/){
            print $line;
        }
    }
}
ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Daniel3.8k
1
gravatar for Daniel Standage
5.2 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

Since you asked for Python, this simple script should work.

import sys
for line in sys.stdin:
    line = line.rstrip()
    values = line.split("\t")
    start, end = int(values[3]), int(values[4])
    length = end - start + 1
    if length >= 1000:
        print line


It was not tested on a file containing comments or pragmas, but that should be easy to account for.

In your question, you mention semicolon-separated attributes. If you're trying to further process the data (rather than just filtering out lines < 1 kb), you'll need to be more specific about what you need.

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Daniel Standage3.9k

Thanks for your reply. I implemented the method and it worked. Yes, I also need to keep everything with attribute SuperFamily=LINES and filter out everything else. I hoped there was an implementation with Biopython, but I guess I can work with re and search.

ADD REPLYlink written 5.2 years ago by rimjhim.roy.ch70

No need for regular expressions or the search function there: a simple change will suffice. Change if length >= 1000: to if length >= 1000 and "SuperFamily=LINES" in values[8]:.

ADD REPLYlink written 5.2 years ago by Daniel Standage3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 643 users visited in the last hour