I am trying to filter my Maker annotation output in gff format. I filtered the total anotations based on AED score and other parameters (i.e similarity to known proteins of similar origin etc). I am using readGFF of rtracklayer library in R as a parser and a tool to manipulate the file.
Here is how it looks like:
DataFrame with 12 rows and 17 columns seqid source type start end score strand phase ID <factor> <factor> <factor> <integer> <integer> <numeric> <character> <integer> <character> 1 Chr1 NA contig 1 57885339 NA * NA Chr1 2 Chr1 maker gene 261435 262314 NA + NA augustus_masked-Chr1-processed-gene-0.56 3 Chr1 maker mRNA 261435 262314 NA + NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1 4 Chr1 maker exon 261435 261545 NA + NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1:exon:0 5 Chr1 maker exon 261819 262079 NA + NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1:exon:1 ... ... ... ... ... ... ... ... ... ... 8 Chr1 maker CDS 261819 262079 NA + 0 augustus_masked-Chr1-processed-gene-0.56-mRNA-1:cds 9 Chr1 maker CDS 262258 262314 NA + 0 augustus_masked-Chr1-processed-gene-0.56-mRNA-1:cds 10 Chr1 maker gene 88146 88709 NA + NA augustus_masked-Chr1-processed-gene-0.47 11 Chr1 maker mRNA 88146 88709 NA + NA augustus_masked-Chr1-processed-gene-0.47-mRNA-1 12 Chr1 maker exon 88146 88709 NA + NA augustus_masked-Chr1-processed-gene-0.47-mRNA-1:exon:3 Name Parent _AED _eAED _QI Target <character> <CharacterList> <character> <character> <character> <character> 1 Chr1 NA NA NA NA 2 augustus_masked-Chr1-processed-gene-0.56 NA NA NA NA 3 augustus_masked-Chr1-processed-gene-0.56-mRNA-1 augustus_masked-Chr1-processed-gene-0.56 0.37 -0.28 0|0|0|0.66|1|1|3|0|142 NA 4 NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1 NA NA NA NA 5 NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1 NA NA NA NA ... ... ... ... ... ... ... 8 NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1 NA NA NA NA 9 NA augustus_masked-Chr1-processed-gene-0.56-mRNA-1 NA NA NA NA 10 augustus_masked-Chr1-processed-gene-0.47 NA NA NA NA 11 augustus_masked-Chr1-processed-gene-0.47-mRNA-1 augustus_masked-Chr1-processed-gene-0.47 0.00 -0.00 0|-1|0|1|-1|1|1|0|187 NA 12 NA augustus_masked-Chr1-processed-gene-0.47-mRNA-1 NA NA NA NA Gap score <character> <CharacterList> 1 NA 2 NA 3 NA 4 NA 5 NA ... ... ... 8 NA 9 NA 10 NA 11 NA 12 NA
I want to have a file with a certain list of genes only,including all the gene features (CDS, exons etc) that this mRNA as a parent of. For example, in a GFF above I would like to leave only the rows that are related to augustus_masked-Chr1-processed-gene-0.56, and leave out the other gene, with its exon and CDS. It should be probably a regex based filter. Maybe there is a library in R that has additional "filter" for stuff like this? Maybe smth in command line that I miss? Ideally, I would have as a result, the full GFF with all non-gene related genomic features (repeats and more), and just the genes that are NOT on my list with all their "children" filtered out.
UPD. For now, the not very elegant way to do it that I found is do a "grep -f" with the list of genes that I want, which leaves me with GFF of genes only. And than, adding this filtered GFF to all the genomic features GFF... But..Maybe there is something more elegant?