Question

Extracting regions around sites in VCF file and writing them as fasta sequences/files

0

Entering edit mode

8 months ago

Aiswarya ▴ 10

Hai,

I am trying to extract 50 base pair sequence before and 50 bp after the coordinates present in my VCF files and write them to a file as fasta sequences. I have both the BAM files and the reference genome.

Can someone help me with the same?

Thanks in advance

VCF fasta • 1.4k views

ADD COMMENT • link updated 7 months ago by GenoMax 142k • written 8 months ago by Aiswarya ▴ 10

0

Entering edit mode

~~See the answers here: https://www.biostars.org/p/46331/~~

My mistake. Will leave this here since the comment below is nested.

ADD REPLY • link 8 months ago by GenoMax 142k

0

Entering edit mode

Is she not telling here that her coordinates are in VCF file and sequence in bam or reference genome? If so, the it can be done like this using bedtools.

#Create fasta from bam file.
samtools bam2fq my.bam | seqtk seq -A > my.fasta

cat my.fasta
>gene1
ATGCGCTCGCTGATCGATCATCGATCATCATCGATCGATCGA

#prepare bed file using site from your vcf file
cat my.bed 
gene1   10  15  gene1_subset

getfasta -fi my.fasta -bed my.bed -name
#index file my.fasta.fai not found, generating...
>gene1_subset::gene1:10-15
TGATC

ADD REPLY • link 8 months ago by bk11 ★ 2.4k

0

Entering edit mode

It is unclear if OP wants to get the region from the fastq/alignments or from original/new consensus reference.

ADD REPLY • link 8 months ago by GenoMax 142k

GenoMax · Answer 1 · 2023-08-29

1

Entering edit mode

8 months ago

Pierre Lindenbaum 161k

bcftools query -f '%CHROM\t%POS0\t%END\n' rotavirus_rf.vcf.gz |\
bedtools slop -b 50 -g rotavirus_rf.fa.fai  |\
bedtools getfasta -fi rotavirus_rf.fa -bed -

ADD COMMENT • link 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

@Pierre does bedtools slop take genome.fa.fai file instead of just genome.fa?

ADD REPLY • link 8 months ago by GenoMax 142k

1

Entering edit mode

the fai file. It wants a file chrom(tab)length to get the max-size of the chromosomes.

ADD REPLY • link 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

That worked perfectly!

Thank you very much.

ADD REPLY • link 8 months ago by Aiswarya ▴ 10

0

Entering edit mode

Please accept the answer so the question is marked solved on the website. To do that, click on the green check mark on the left side of the answer.

ADD REPLY • link 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hello,

I was trying to use the command cited in this post. It workes for me.

bcftools query -f '%CHROM\t%POS0\t%END\n' variation_sorted.vcf |\ bedtools slop -b 150 -g Equus_caballus.EquCab3.0.dna.toplevel.fa.fai |\ bedtools getfasta -fi Equus_caballus.EquCab3.0.dna.toplevel.fa -bed upflank.bed -tab -fo test.txt -name

However, I need the polymorphism. Is there any way to get the final sequence with the polymorphism with IUPAC code?

E.g: ATGGTGATGGGAGGGCACGTGGACCGACGGGTGAACAGCTCTGTGACCGTGGGGCCAACGCTCTCGGGTGAGGCCCTGCCAAGGGGGCGAAACRCTGCCCGSACWGTSCGGGCAGTGGTGGTGAGSCCYCAGGCTGAGGGCTCACCCAGC[C/T] GCAGTCAGGCCCTGGAGCTGCTAAGTAGCCTGGTGCCTGCTGAGCGTAGCCCACCTACYGGCCRGCTTCCTAGGCCCATGGCTGTTGTGCCAAGGAGTCCAGGTCTGGGTCGCTCAGTARGTGAAGCCCTGGGGCAGCTACCTGAGACAG

Help please

ADD REPLY • link updated 7 months ago by GenoMax 142k • written 7 months ago by estpallag ▴ 10

0

Entering edit mode

a pipe '|' is missing before the first '\'

ADD REPLY • link 7 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi. You are right. I made it work, but I still do not have what I need: my flanked variants with the polymorphism in IUPAC code. Ideas?

ADD REPLY • link 7 months ago by estpallag ▴ 10