Finding sequences in unannotated genomes using reference coordinates
1
0
Entering edit mode
7 months ago
Prangan ▴ 20

Hey Stars!

I have a really confounding issue at hand. I am working on extracting upstream regions of genes from 100 different genomes of A. thaliana. The problem being, I have one reference genome for TAIR10 version (which has an annotated GTF/GFF) and the rest of the genomes I have are consensus-builds from VCF files (having no annotation data available):

cat Arabidopsis_thaliana.TAIR10.55.dna.toplevel.fa | vcf-consensus some.vcf.gz > vcf.fa

I have extracted the upstream regions of some target genes from the reference genome using RSAT

rsat retrieve-seq -org Arabidopsis_thaliana.TAIR10.55 -feattype gene -type upstream -format fasta -label id,name -from -2000 -to -1 -noorf -i Genes.txt -o ups.fa

Next, using the upstream coordinates from the above step, I extracted the sequences from the rest of the genomes (consensus-builds). But, now that I am comparing the consensus-extracted upstream sequences with the reference-upstream sequences and their respective positions in the original VCFs, they do not match up. I think this may be due to indels in the VCF. I am looking for any suggestions/methods to extract the reference upstream sequences (with alternate allele insertions) from the VCF genomes.

Any and all help is highly appreciated. Thank you!

consensus VCF • 655 views
ADD COMMENT
0
Entering edit mode

All done! Thanks.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode
6 months ago
Prangan ▴ 20

I may have stumbled on a solution for this problem. What I wanted was the SNPs or mutations in the VCF to be incorporated in the reference. Since I already had my regions of interest (upstream sequences, determined by start and end coordinates), indels in the overall genome (except if present in these sequences) were of no interest to me. But creating a whole genome consensus was leading to indels being incorporated in the genome and the coordinates of my upstream sequences no longer corresponded to their respective sequences in the genome. To solve this, I specifically built consensus of my target regions (using coordinates) by the following method:

coordinates = txt file containing my coordinates in the form:
1:7753520-7754280
1:27548283-27549020
1:25493078-25493192

vcf = tabix indexed gzipped vcf file (.gz)

samtools faidx $reference_genome `cat $coordinates` | bcftools consensus $vcf >> results.fa

I hope this helps someone. If anybody has any better way to do this, please let me know. Cheers!

ADD COMMENT

Login before adding your answer.

Traffic: 1518 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6