remove out short predict genes from gff file
0
0
Entering edit mode
7.2 years ago
fufuyou ▴ 110

Hi, How can I remove out short predict genes from gff file? Or how can I set a value for CDS or protein length? Thanks, Fuyou

genome • 3.7k views
ADD COMMENT
2
Entering edit mode

Could you please provide part of your file and explain the reason for such filtration of your gff file. If you used a program to predict gene models then the gene length cutoff should be set in it because it affects your statistical model. If you are trying to have only high quality predicted gene models and you assumed that short genes are potential errors, then you have to look at GO terms and see if in other species these GO terms are enriched with short genes.

ADD REPLY
0
Entering edit mode

It would be good if you could add some more information to your question. Based on what would you filter? The distance between begin and end has to be a minimal value? Try to be as specific as possible!

ADD REPLY
0
Entering edit mode

Thanks. I think it is not the distance between begin and end. I think I want to know how to set a minimal value for protein sequence or CDS. fUYOU

ADD REPLY
2
Entering edit mode

A minimum what? And if you aren't sure, how should we know? Maybe it's best that you first figure out what want before asking people to help you.

ADD REPLY
0
Entering edit mode

Thanks. I am sorry about my quesition is not clear. My mean is that I have gotten a gff files based on some predict software. But I find some genes is so short. I want to remove out these short genes. For example, I hope all genes protein sequences is more than 50 aa using this gff files. Or all genes CDS is more than 150 bp. I want to remove out some predicted genes with lower than 50 aa protein sequences. Like as following:

ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001
ctg123 . mRNA           10000 10120  .  +  .  ID=mrna0002;Parent=operon001;Name=subsonicsquirrel
ctg123 . exon           10000 10120  .  +  .  Parent=mrna0002

I want to remove out the second predicted, mrna0002.

ADD REPLY
2
Entering edit mode

How about: awk '{if (($5 - $4)> 150) print $0}' your_file > new_file Adjust 150 to a value that will exclude things smaller than that length.

ADD REPLY
0
Entering edit mode

Thanks, But I think I should only do mRNA line.

ADD REPLY
1
Entering edit mode

If you want to only remove mRNA line then: awk '{if (($5 - $4)> 150 || ($3 == "exon")) print $0}' your_file > new_file.
If you want to only keep mRNA line then: awk '{if (($5 - $4)> 150 || ($3 == "mRNA")) print $0}' your_file > new_file

ADD REPLY
0
Entering edit mode

Thanks, My mean is if one gene, for example mrna0001, $5-$4 > 150 in mRNA line, I want to keep mRNA and exon line. If one gene, for example mrna0002, $5-$4 < 150 in mRNA line, I want to remove both mRNA and exon. I want to get the result is

ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001

. I think your code shoul be close what I want. I am very appreciated your help. Fuyou

ADD REPLY
0
Entering edit mode

Careful with the $5 -$4 thing, that's the length of mRNA in genomic coordinates ($end-$start) and this is not the same as the length of the transcript not to mention CDS. There are no annotation of CDS in your example GFF, nor UTRs. Without this information, the length of the CDS cannot be determined. In addition, there is something more that is odd:

ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001

These exons overlap, but they have the same parent transcript, but if one has a different start, they cannot both belong to the same mRNA, can they? Even if, it shows that you cannot just sum over the length of each exon for each transcript.

ADD REPLY

Login before adding your answer.

Traffic: 2722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6