Question: remove out short predict genes from gff file
0
gravatar for fufuyou
8 months ago by
fufuyou90
United States
fufuyou90 wrote:

Hi, How can I remove out short predict genes from gff file? Or how can I set a value for CDS or protein length? Thanks, Fuyou

genome • 308 views
ADD COMMENTlink modified 3 months ago by Biostar ♦♦ 20 • written 8 months ago by fufuyou90
2

Could you please provide part of your file and explain the reason for such filtration of your gff file. If you used a program to predict gene models then the gene length cutoff should be set in it because it affects your statistical model. If you are trying to have only high quality predicted gene models and you assumed that short genes are potential errors, then you have to look at GO terms and see if in other species these GO terms are enriched with short genes.

ADD REPLYlink written 8 months ago by Petr Ponomarenko2.4k

It would be good if you could add some more information to your question. Based on what would you filter? The distance between begin and end has to be a minimal value? Try to be as specific as possible!

ADD REPLYlink written 8 months ago by WouterDeCoster22k

Thanks. I think it is not the distance between begin and end. I think I want to know how to set a minimal value for protein sequence or CDS. fUYOU

ADD REPLYlink written 8 months ago by fufuyou90
2

A minimum what? And if you aren't sure, how should we know? Maybe it's best that you first figure out what want before asking people to help you.

ADD REPLYlink written 8 months ago by WouterDeCoster22k

Thanks. I am sorry about my quesition is not clear. My mean is that I have gotten a gff files based on some predict software. But I find some genes is so short. I want to remove out these short genes. For example, I hope all genes protein sequences is more than 50 aa using this gff files. Or all genes CDS is more than 150 bp. I want to remove out some predicted genes with lower than 50 aa protein sequences. Like as following:

ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001
ctg123 . mRNA           10000 10120  .  +  .  ID=mrna0002;Parent=operon001;Name=subsonicsquirrel
ctg123 . exon           10000 10120  .  +  .  Parent=mrna0002

I want to remove out the second predicted, mrna0002.

ADD REPLYlink modified 8 months ago by genomax34k • written 8 months ago by fufuyou90
1

How about: awk '{if (($5 - $4)> 150) print $0}' your_file > new_file Adjust 150 to a value that will exclude things smaller than that length.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax34k

Thanks, But I think I should only do mRNA line.

ADD REPLYlink written 8 months ago by fufuyou90
1

If you want to only remove mRNA line then: awk '{if (($5 - $4)> 150 || ($3 == "exon")) print $0}' your_file > new_file.
If you want to only keep mRNA line then: awk '{if (($5 - $4)> 150 || ($3 == "mRNA")) print $0}' your_file > new_file

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax34k

Thanks, My mean is if one gene, for example mrna0001, $5-$4 > 150 in mRNA line, I want to keep mRNA and exon line. If one gene, for example mrna0002, $5-$4 < 150 in mRNA line, I want to remove both mRNA and exon. I want to get the result is

ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001

. I think your code shoul be close what I want. I am very appreciated your help. Fuyou

ADD REPLYlink modified 8 months ago by genomax34k • written 8 months ago by fufuyou90

Careful with the $5 -$4 thing, that's the length of mRNA in genomic coordinates ($end-$start) and this is not the same as the length of the transcript not to mention CDS. There are no annotation of CDS in your example GFF, nor UTRs. Without this information, the length of the CDS cannot be determined. In addition, there is something more that is odd:

ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001

These exons overlap, but they have the same parent transcript, but if one has a different start, they cannot both belong to the same mRNA, can they? Even if, it shows that you cannot just sum over the length of each exon for each transcript.

ADD REPLYlink modified 3 months ago • written 3 months ago by Michael Dondrup43k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 948 users visited in the last hour