I have a dataframe showing the sample number, the chromosome and genomic position of SNVs:
sample chrom pos
1 1 2L 544805
2 1 2L 621152
3 1 2L 639908
4 1 2L 1017443
5 1 2L 1189982
6 1 2L 1678066
And a genome annotation file in .gtf format:
X FlyBase gene X19961297 X19969323 . X. ..1 gene_id.FBgn0031081..gene_symbol.Nep3.
1 X FlyBase mRNA 19961689 19968479 15 + . gene_id FBgn0031081; gene_symbol Nep3; transcript_id FBtr0070000; transcript_symbol Nep3-RA;
2 X FlyBase 5UTR 19961689 19961845 15 + . gene_id FBgn0031081; gene_symbol Nep3; transcript_id FBtr0070000; transcript_symbol Nep3-RA;
3 X FlyBase exon 19961689 19961845 15 + . gene_id FBgn0031081; gene_symbol Nep3; transcript_id FBtr0070000; transcript_symbol Nep3-RA;
4 X FlyBase exon 19963955 19964071 15 + . gene_id FBgn0031081; gene_symbol Nep3; transcript_id FBtr0070000; transcript_symbol Nep3-RA;
5 X FlyBase exon 19964782 19964944 15 + . gene_id FBgn0031081; gene_symbol Nep3; transcript_id FBtr0070000; transcript_symbol Nep3-RA;
6 X FlyBase exon 19965006 19965126 15 + . gene_id FBgn0031081; gene_symbol Nep3; transcript_id FBtr0070000; transcript_symbol Nep3-RA;
I want to find the smallest feature that each SNV is in.
For example, if a SNV was contained in an 'exon' (and therefore also 'gene' and 'mRNA'), I would want to report that SNV as being in an exon.
I would also like to be able to look for enrichment of SNVs in particular genomic features, and collect some basic statistics about the dataset (e.g. how many SNVs hit exon, introns etc).
Is there a package in R that be used for this sort of thing?
I've had a look at GenomicRanges and VariantAnnotation, but I can't see a clear way of achieving this.
Any help would be appreciated