I have a really confounding issue at hand. I am working on extracting upstream regions of genes from 100 different genomes of A. thaliana. The problem being, I have one reference genome for TAIR10 version (which has an annotated GTF/GFF) and the rest of the genomes I have are consensus-builds from VCF files (having no annotation data available):
cat Arabidopsis_thaliana.TAIR10.55.dna.toplevel.fa | vcf-consensus some.vcf.gz > vcf.fa
I have extracted the upstream regions of some target genes from the reference genome using RSAT
rsat retrieve-seq -org Arabidopsis_thaliana.TAIR10.55 -feattype gene -type upstream -format fasta -label id,name -from -2000 -to -1 -noorf -i Genes.txt -o ups.fa
Next, using the upstream coordinates from the above step, I extracted the sequences from the rest of the genomes (consensus-builds). But, now that I am comparing the consensus-extracted upstream sequences with the reference-upstream sequences and their respective positions in the original VCFs, they do not match up. I think this may be due to indels in the VCF. I am looking for any suggestions/methods to extract the reference upstream sequences (with alternate allele insertions) from the VCF genomes.
Any and all help is highly appreciated. Thank you!