I have a fasta file containing tag pairs (63 identical nucleotides + 1 SNP) like so:
>TP2_query CAGCAAAAAAAAAAAAAAAAAAATGAAACTTCTGAGAGGGAAAAATTACTCAACCTTATATTCT >TP2_hit CAGCAAAAAAAAAAAAAAAAAAATGTAACTTCTGAGAGGGAAAAATTACTCAACCTTATATTCT
I would like to produce 1 sequence per tag pair that is centered on the SNP (can be either allele) and flanked by 50 bp (coming from the reference genome) on either side. I'm thinking of this as a two step process:
(1) to determine the SNP position in the tag pair and the corresponding coordinates in the reference genome
(2) adding flanking sequence from the reference using the coordinates from (1)
Any ideas about (1) would be greatly appreciated.
I found bedtools flank is able to do (2) in the easiest possible way (maybe samtools faidx too). The command needs a "genome" file for the species (Medicago in my case). If I understand correctly, this file is simply tab delimited maximum numeric values of coordinates for each chromosome? I can figure out how to do this with some cheap python or R code but wondering if there's a neat unix solution for this.