What I have: I have analyzed a set of WES data with VarScan 2. I now have a somatic variant call .csv with "chrom", "position", "ref", "var", etc. as columns headers. E.g.:
chrom position ref var normal_reads1 CM000994.2 3681266 T G 229 CM000994.2 6558171 A C 11 . . .
What I need: I am trying to (1) get the reference nucleotide at a position as well as the flanking nucleotides (i.e. n–1 and n+1). I then need to concatenate them to make a trinucleotide string for each variant (e.g. "ATG", where ref="T", n–1="A", n+1="G").
My problem: The VarScan 2 output has, as far as I can tell, GenBank accession ID's as values under the "chrom" column (e.g. CM000994.2) I am working with mouse build mm10 and have set that as my biomaRt dataset. I am trying to use biomaRt to get the reference nucleotide, the n–1 nucleotide, and the n+1 nucleotide. However, I can't figure out how to use GenBank accession ID as a query input for biomaRt. Here is an example of what I am getting:
> getSequence(chromosome = CM000994.2, start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = chromosome_name) Error in getBM(c(seqType, type), filters = c("chromosome_name", "start", : object 'CM000994.2' not found
How do I either 1) access nucleotide positions using a GenBank chromosome accesion ID and a position, or 2) convert GenBank chromosome accession ID to a biomaRt-usable chromosome label? ANY help would be appreciated.
EDIT: Should I just use BIostrings+BSgenome+GenomicRanges for this?