Question

Finding sequences in unannotated genomes using reference coordinates

0

Entering edit mode

7 months ago

Prangan ▴ 20

Hey Stars!

I have a really confounding issue at hand. I am working on extracting upstream regions of genes from 100 different genomes of A. thaliana. The problem being, I have one reference genome for TAIR10 version (which has an annotated GTF/GFF) and the rest of the genomes I have are consensus-builds from VCF files (having no annotation data available):

cat Arabidopsis_thaliana.TAIR10.55.dna.toplevel.fa | vcf-consensus some.vcf.gz > vcf.fa

I have extracted the upstream regions of some target genes from the reference genome using RSAT

rsat retrieve-seq -org Arabidopsis_thaliana.TAIR10.55 -feattype gene -type upstream -format fasta -label id,name -from -2000 -to -1 -noorf -i Genes.txt -o ups.fa

Next, using the upstream coordinates from the above step, I extracted the sequences from the rest of the genomes (consensus-builds). But, now that I am comparing the consensus-extracted upstream sequences with the reference-upstream sequences and their respective positions in the original VCFs, they do not match up. I think this may be due to indels in the VCF. I am looking for any suggestions/methods to extract the reference upstream sequences (with alternate allele insertions) from the VCF genomes.

Any and all help is highly appreciated. Thank you!

consensus VCF • 655 views

ADD COMMENT • link 6 months ago by Prangan ▴ 20

1

Entering edit mode

Please review all your previous questions and comment or validate them (green tick on the left) : How reproducible is transcript quantification through salmon? Generate hashes for all sequences in a FASTA file FASTA to GTF/GFF using reference genome How to export stdout of bash script to a text file

ADD REPLY • link 7 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

All done! Thanks.

ADD REPLY • link 7 months ago by Prangan ▴ 20

1

Entering edit mode

related: Vcf Locations In Consensus Sequence

ADD REPLY • link 7 months ago by Pierre Lindenbaum 161k

score 0 · Answer 1 · 2023-10-04

I may have stumbled on a solution for this problem. What I wanted was the SNPs or mutations in the VCF to be incorporated in the reference. Since I already had my regions of interest (upstream sequences, determined by start and end coordinates), indels in the overall genome (except if present in these sequences) were of no interest to me. But creating a whole genome consensus was leading to indels being incorporated in the genome and the coordinates of my upstream sequences no longer corresponded to their respective sequences in the genome. To solve this, I specifically built consensus of my target regions (using coordinates) by the following method:

coordinates = txt file containing my coordinates in the form:
1:7753520-7754280
1:27548283-27549020
1:25493078-25493192

vcf = tabix indexed gzipped vcf file (.gz)

samtools faidx $reference_genome `cat $coordinates` | bcftools consensus $vcf >> results.fa

I hope this helps someone. If anybody has any better way to do this, please let me know. Cheers!