remove UTR annotations from gff
2
0
Entering edit mode
22 months ago
rob234king ▴ 600

I have UTR annotations which I need to remove for a set of genes, which I could do just by deleting but then it would not update the start and stop positions of the gene and mRNA features, and correct exon features. Is there a tool to remove UTR features from a gff file and make the necessary corrections to the remaining mRNA and gene, and exon features? I can't seem to find something but there must be??

gff • 759 views
2
Entering edit mode
22 months ago
AK ★ 2.0k

Hi rob234king,

I think you can try gff3_file_UTR_trimmer.pl from PASApipeline:

(edited: changed the example to the gff3 in PASA directory; still thinking the meaning of a shift of phase, not a copy-paste problem)

$cat test.gff3 gi|68711 TIGR gene 12923 14228 . + . ID=68711.t00017;Name=my_protein_name gi|68711 TIGR mRNA 12923 14228 . + . ID=model.68711.m00017;Parent=68711.t00017 gi|68711 TIGR five_prime_utr 12923 13029 . + . ID=utr5p_of_68711.m00017;Parent=68711.m00017 gi|68711 TIGR exon 12923 13060 . + . ID=68711.e00069;Parent=model.68711.m00017 gi|68711 TIGR CDS 13030 13060 . + 0 ID=13030_13060cds_of_68711.m00017;Parent=model.68711.m00017 gi|68711 TIGR exon 13411 13550 . + . ID=68711.e00070;Parent=model.68711.m00017 gi|68711 TIGR CDS 13411 13550 . + 1 ID=13411_13550cds_of_68711.m00017;Parent=model.68711.m00017 gi|68711 TIGR exon 13677 13802 . + . ID=68711.e00071;Parent=model.68711.m00017 gi|68711 TIGR CDS 13677 13802 . + 0 ID=13677_13802cds_of_68711.m00017;Parent=model.68711.m00017 gi|68711 TIGR exon 13876 14228 . + . ID=68711.e00072;Parent=model.68711.m00017 gi|68711 TIGR CDS 13876 14016 . + 0 ID=13876_14016cds_of_68711.m00017;Parent=model.68711.m00017 gi|68711 TIGR three_prime_utr 14017 14228 . + . ID=utr3p_of_68711.m00017;Parent=68711.m00017$ perl ~/src/PASApipeline-v2.3.3/misc_utilities/gff3_file_UTR_trimmer.pl test.gff3
gi|68711    TIGR    gene    13030   14016   .   +   .   ID=68711.t00017.1;Name=my_protein_name
gi|68711    TIGR    mRNA    13030   14016   .   +   .   ID=model.68711.m00017;Parent=68711.t00017.1;Name=my_protein_name
gi|68711    TIGR    exon    13030   13060   .   +   .   ID=model.68711.m00017.exon1;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13030   13060   .   +   0   ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13411   13550   .   +   .   ID=model.68711.m00017.exon2;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13411   13550   .   +   2   ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13677   13802   .   +   .   ID=model.68711.m00017.exon3;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13677   13802   .   +   0   ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13876   14016   .   +   .   ID=model.68711.m00017.exon4;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13876   14016   .   +   0   ID=cds.model.68711.m00017;Parent=model.68711.m00017

0
Entering edit mode

great thanks, I'll give it a go

0
Entering edit mode

It worked perfectly, once I had the right perl version running for it.

0
Entering edit mode

Weird that the phase of the second CDS have changed from 1 to 2 during the process. A bug or copy past problem?

1
Entering edit mode
22 months ago
Ace ▴ 80

It seems simplest to just use grep for this.

cat $gff | grep -v "five_prime_UTR" | grep -v "three_prime_UTR" >$new_gff

You could probably even just use

cat $gff | grep -v "_UTR" and testing is as easy as cat$gff | grep -v "_UTR" | cut -f 3 | grep -v "#" | sort -u

cat \$gff | grep "_UTR" | cut -f 3 | grep -v "#" | sort -u

wherein the first diagnostic should contain know UTR categories and the last should be only UTR categories

0
Entering edit mode

But as rob234king said:

I could do just by deleting but then it would not update the start and stop positions of the gene and mRNA features, and correct exon features.

0
Entering edit mode

Oh good catch, I must have glossed over that initially. I've written if statements into for loops to do this previously. Logic follows like this:

for gene in list

if strand=+

if UTR_start < gene start

new_Gene_start=UTR_Start

Repeat for end, set - strand conditions, etc etc

but it's a bit bulky and I've never tried to incorporate exons this way. Better to not reinvent the wheel if there's a functional perl script.