Question: remove out short predict genes from gff file
0
gravatar for fufuyou
5 weeks ago by
fufuyou80
United States
fufuyou80 wrote:

Hi, How can I remove out short predict genes from gff file? Or how can I set a value for CDS or protein length? Thanks, Fuyou

genome • 127 views
ADD COMMENTlink modified 5 weeks ago by Petr Ponomarenko1.2k • written 5 weeks ago by fufuyou80
2

Could you please provide part of your file and explain the reason for such filtration of your gff file. If you used a program to predict gene models then the gene length cutoff should be set in it because it affects your statistical model. If you are trying to have only high quality predicted gene models and you assumed that short genes are potential errors, then you have to look at GO terms and see if in other species these GO terms are enriched with short genes.

ADD REPLYlink written 5 weeks ago by Petr Ponomarenko1.2k

It would be good if you could add some more information to your question. Based on what would you filter? The distance between begin and end has to be a minimal value? Try to be as specific as possible!

ADD REPLYlink written 5 weeks ago by WouterDeCoster14k

Thanks. I think it is not the distance between begin and end. I think I want to know how to set a minimal value for protein sequence or CDS. fUYOU

ADD REPLYlink written 5 weeks ago by fufuyou80
2

A minimum what? And if you aren't sure, how should we know? Maybe it's best that you first figure out what want before asking people to help you.

ADD REPLYlink written 5 weeks ago by WouterDeCoster14k

Thanks. I am sorry about my quesition is not clear. My mean is that I have gotten a gff files based on some predict software. But I find some genes is so short. I want to remove out these short genes. For example, I hope all genes protein sequences is more than 50 aa using this gff files. Or all genes CDS is more than 150 bp. I want to remove out some predicted genes with lower than 50 aa protein sequences. Like as following:

ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001
ctg123 . mRNA           10000 10120  .  +  .  ID=mrna0002;Parent=operon001;Name=subsonicsquirrel
ctg123 . exon           10000 10120  .  +  .  Parent=mrna0002

I want to remove out the second predicted, mrna0002.

ADD REPLYlink modified 5 weeks ago by genomax224k • written 5 weeks ago by fufuyou80
1

How about: awk '{if (($5 - $4)> 150) print $0}' your_file > new_file Adjust 150 to a value that will exclude things smaller than that length.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax224k

Thanks, But I think I should only do mRNA line.

ADD REPLYlink written 5 weeks ago by fufuyou80
1

If you want to only remove mRNA line then: awk '{if (($5 - $4)> 150 || ($3 == "exon")) print $0}' your_file > new_file.
If you want to only keep mRNA line then: awk '{if (($5 - $4)> 150 || ($3 == "mRNA")) print $0}' your_file > new_file

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax224k

Thanks, My mean is if one gene, for example mrna0001, $5-$4 > 150 in mRNA line, I want to keep mRNA and exon line. If one gene, for example mrna0002, $5-$4 < 150 in mRNA line, I want to remove both mRNA and exon. I want to get the result is

ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001

. I think your code shoul be close what I want. I am very appreciated your help. Fuyou

ADD REPLYlink modified 5 weeks ago by genomax224k • written 5 weeks ago by fufuyou80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1592 users visited in the last hour