How to parse GFF3 file and get start and end coordinates in Python
3
1
Entering edit mode
9.0 years ago

I am new to Python. I want to create a new GFF3 file from an existing file by filtering out features that are smaller than 1 kb.

Is there a way to parse a GFF3 file containing feature information of semicolon separated tags as well as start and end of the feature, and create a new file altogether?

My gff file looks like below:

seq_chr1    S-MART    match    158337    160567    .    -    .    Superfamily=LINE;Target=RIL-Map20 356 2619;ID=ms1_seq_chr1_RIL-Map20;Order=TE;Class=Unknown;Identity=93.9881;Name=ms1_seq_chr1_RIL-Map20
python parse gff3 • 12k views
ADD COMMENT
3
Entering edit mode
9.0 years ago
Ryan Dale 5.0k

As your analysis gets more complex, you might find gffutils useful for accessing GFF attributes. It has a fairly robust GFF/GTF parser. Here's an example that only keeps >1kb features that are in the LINE superfamily (as you suggest is your goal in a comment on another answer).

(edit: disclaimer, I'm the primary author)

ADD COMMENT
1
Entering edit mode
9.0 years ago
Daniel ★ 4.0k

This is a simple perl script to filter by size which I had lying around which could do that, but python would be just as simple to learn for the long run. I recommend www.pythonforbiologists.com

#!/usr/bin/perl
use strict;
use warnings;

my $usage = "## USAGE:  filter_csv.pl infile.txt [min value]\n";

my $infile = $ARGV[0] or die $usage;
my $filter = $ARGV[1] or die $usage;

open INFILE, "<", "$infile" or die "$usage | check $infile\n";

my $n=0;

while (my $line = <INFILE>){
    $n++;
    if ($n == 1){
        ## Don't process on header line (Maybe you don't need this bit)
        print $line;
    }else{
        my @l = split(/\t/, $line);
        if ($l[4] - $l[3] >= $filter){
            print $line;
        }
    }
}

You would run it like so:

./filter_csv.pl my_file.gff 1000

You could add in a filter for the right term by using a similar if clause and adding it in the size loop.

EDIT: For example (but not tested):

while (my $line = <INFILE>){
    $n++;
    if ($n == 1){
        ## Don't process on header line (Maybe you don't need this bit)
        print $line;
    }else{
       my @l = split(/\t/, $line);
       if ($l[4] - $l[3] >= $filter) && ($line ~= /$term/`){
            print $line;
        }
    }
}
ADD COMMENT
1
Entering edit mode
9.0 years ago

Since you asked for Python, this simple script should work.

import sys
for line in sys.stdin:
    line = line.rstrip()
    values = line.split("\t")
    start, end = int(values[3]), int(values[4])
    length = end - start + 1
    if length >= 1000:
        print line


It was not tested on a file containing comments or pragmas, but that should be easy to account for.

In your question, you mention semicolon-separated attributes. If you're trying to further process the data (rather than just filtering out lines < 1 kb), you'll need to be more specific about what you need.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. I implemented the method and it worked. Yes, I also need to keep everything with attribute SuperFamily=LINES and filter out everything else. I hoped there was an implementation with Biopython, but I guess I can work with re and search.

ADD REPLY
0
Entering edit mode

No need for regular expressions or the search function there: a simple change will suffice. Change if length >= 1000: to if length >= 1000 and "SuperFamily=LINES" in values[8]:.

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6