Filtering GFF annotations file
Entering edit mode
2.9 years ago
alslonik ▴ 190

Hi all.

I am trying to filter my Maker annotation output in gff format. I filtered the total anotations based on AED score and other parameters (i.e similarity to known proteins of similar origin etc). I am using readGFF of rtracklayer library in R as a parser and a tool to manipulate the file.

Here is how it looks like:

DataFrame with 12 rows and 17 columns
       seqid   source     type     start       end     score      strand     phase                                                     ID
    <factor> <factor> <factor> <integer> <integer> <numeric> <character> <integer>                                            <character>
1       Chr1       NA   contig         1  57885339        NA           *        NA                                                   Chr1
2       Chr1    maker     gene    261435    262314        NA           +        NA               augustus_masked-Chr1-processed-gene-0.56
3       Chr1    maker     mRNA    261435    262314        NA           +        NA        augustus_masked-Chr1-processed-gene-0.56-mRNA-1
4       Chr1    maker     exon    261435    261545        NA           +        NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1:exon:0
5       Chr1    maker     exon    261819    262079        NA           +        NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1:exon:1
...      ...      ...      ...       ...       ...       ...         ...       ...                                                    ...
8       Chr1    maker      CDS    261819    262079        NA           +         0    augustus_masked-Chr1-processed-gene-0.56-mRNA-1:cds
9       Chr1    maker      CDS    262258    262314        NA           +         0    augustus_masked-Chr1-processed-gene-0.56-mRNA-1:cds
10      Chr1    maker     gene     88146     88709        NA           +        NA               augustus_masked-Chr1-processed-gene-0.47
11      Chr1    maker     mRNA     88146     88709        NA           +        NA        augustus_masked-Chr1-processed-gene-0.47-mRNA-1
12      Chr1    maker     exon     88146     88709        NA           +        NA augustus_masked-Chr1-processed-gene-0.47-mRNA-1:exon:3
                                               Name                                          Parent        _AED       _eAED                    _QI      Target
                                        <character>                                 <CharacterList> <character> <character>            <character> <character>
1                                              Chr1                                                          NA          NA                     NA          NA
2          augustus_masked-Chr1-processed-gene-0.56                                                          NA          NA                     NA          NA
3   augustus_masked-Chr1-processed-gene-0.56-mRNA-1        augustus_masked-Chr1-processed-gene-0.56        0.37       -0.28 0|0|0|0.66|1|1|3|0|142          NA
4                                                NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1          NA          NA                     NA          NA
5                                                NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1          NA          NA                     NA          NA
...                                             ...                                             ...         ...         ...                    ...         ...
8                                                NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1          NA          NA                     NA          NA
9                                                NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1          NA          NA                     NA          NA
10         augustus_masked-Chr1-processed-gene-0.47                                                          NA          NA                     NA          NA
11  augustus_masked-Chr1-processed-gene-0.47-mRNA-1        augustus_masked-Chr1-processed-gene-0.47        0.00       -0.00  0|-1|0|1|-1|1|1|0|187          NA
12                                               NA augustus_masked-Chr1-processed-gene-0.47-mRNA-1          NA          NA                     NA          NA
            Gap           score
    <character> <CharacterList>
1            NA                
2            NA                
3            NA                
4            NA                
5            NA                
...         ...             ...
8            NA                
9            NA                
10           NA                
11           NA                
12           NA

I want to have a file with a certain list of genes only,including all the gene features (CDS, exons etc) that this mRNA as a parent of. For example, in a GFF above I would like to leave only the rows that are related to augustus_masked-Chr1-processed-gene-0.56, and leave out the other gene, with its exon and CDS. It should be probably a regex based filter. Maybe there is a library in R that has additional "filter" for stuff like this? Maybe smth in command line that I miss? Ideally, I would have as a result, the full GFF with all non-gene related genomic features (repeats and more), and just the genes that are NOT on my list with all their "children" filtered out.

UPD. For now, the not very elegant way to do it that I found is do a "grep -f" with the list of genes that I want, which leaves me with GFF of genes only. And than, adding this filtered GFF to all the genomic features GFF... But..Maybe there is something more elegant?


gff R Maker GFF • 2.3k views
Entering edit mode

I understand you have a preferred list - in that case the grep -f solution is imho an already pretty elegant solution

Entering edit mode

Please read Brief Reminder On How To Ask A Good Question and then try to modify your question accordingly. Currently, it is unclear what you are asking.


Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6