I have VCF
file (containing diallelic variants) and reference genome in Fasta
of some non-model plant. I'd like to extract SNPs
flanking sequences. I've found that bedtools
and samtools faidx
could be to some extent useful, but apparently don't solve the issue. I need to get output (fasta
or tabular file) with following SNPs representation :
TCTCTGCCAATCACTAGAGGCCGCTTTCGCTTTTA[A/G]TTTGTGTGTGGTCAGAGTTCTTCCGGACTTT
Here a little python solution:
import pysam
# open vcf file
vcf = pysam.VariantFile("input.vcf")
# open fasta file
genome = pysam.FastaFile("genome.fa")
# define by how many bases the variant should be flanked
flank = 50
# iterate over each variant
for record in vcf:
# extract sequence
#
# The start position is calculated by subtract the number of bases
# given by 'flank' from the variant position. The position in the vcf file
# is 1-based. pysam's fetch() expected 0-base coordinate. That's why we
# need to subtract on more base.
#
# The end position is calculated by adding the number of bases
# given by 'flank' to the variant position. We also need to add the length
# of the REF value and subtract again 1 due to the 0-based/1-based thing.
#
# Now we have the complete sequence like this:
# [number of bases given by flank]+REF+[number of bases given by flank]
seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
# print out tab seperated columns:
# CRHOM, POS, REF, ALT, flanking sequencing with variant given in the format '[REF/ALT]'
print(
record.chrom,
record.pos,
record.ref,
record.alts[0],
'{}[{}/{}]{}'.format(seq[:flank], record.ref, record.alts[0], seq[flank+len(record.ref):]),
sep="\t"
)
fin swimmer
EDIT:
I rewrote the code. The version before had an off-by-one error.
A one-liner using biostar251649 ( http://lindenb.github.io/jvarkit/Biostar251649.html ) and bioalcidaejdk ( http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html )
$ java -jar dist/biostar251649.jar -R src/test/resources/rotavirus_rf.fa src/test/resources/rotavirus_rf.vcf.gz -n 20 |\
java -jar dist/bioalcidaejdk.jar -F VCF -e 'stream().forEach(V->println(">"+V.getContig()+":"+V.getStart()+"\n"+V.getAttribute("SEQ5_20")+"["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]"+V.getAttribute("SEQ3_20")));'
first command adds two attribute to the VCF containing the flanking sequences.
the second command:
stream()
convert to a stream of variantforEach(V->println(">"
for each variant start writing the fasta headerV.getContig()+":"+V.getStart()+"\n"
write the fasta headerV.getAttribute("SEQ5_20")
write 5' seq["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]"
write all alleles separated with '/'V.getAttribute("SEQ3_20")))
write 3' seq
output:
>RF01:970
TGGTCGATTGCTCTATTGAA[A/C]AATTTCCATTGATGGCTAAA
>RF02:251
TGCTTGAAGTTTTGAAAACA[A/T]AAGAAGAGCACCAAAAAGAG
>RF02:578
TACTATTAAAAGATATGGCA[G/A]TTGAAAATAAAAATTCGAGA
>RF02:877
GACGTTAACTATATACTTAA[T/A]ATGGACAGAAATCTGCCATC
>RF02:1726
AATATGCAACATGTTCAGAC[T/G]TTGACAACAGAAAAATTACA
>RF02:1962
TGAAGATTTTTTAAAAAGAT[TACA/TA]TATTTTCGATGTAGCTAGAG
>RF03:1221
TGACTGCCATGGATTTAGAA[C/G]TACCGATATCTGCGAAATTA
>RF03:1242
TACCGATATCTGCGAAATTA[C/A]TACACCACCCAACTACAGAA
>RF03:1688
TCAAAACCAACTGGAAATAA[T/G]CATCTGTTCATTTTGAGCGG
>RF03:1708
TCATCTGTTCATTTTGAGCG[G/T]TACTAATAAATATTTTAAAT
>RF03:2150
GCCGATGATCCGAACTATTT[T/A]ATTGGAATTAAGTTCAAAAA
>RF03:2201
GAATATGATGTTAAAGTACC[G/C]CATCTTACATTTGGTGTATT
I slightly modified the script provided by finswimmer. Now it's compatible with python 2.X, writes output to the text file instead of STDOUT. Besides, ID of each SNP is present in output file as separate field. Probably, some additional optimisation and syntax polishing are required, but even this version works fine for me.
from __future__ import print_function
import pysam
f = open("SNP_CDS.txt", "w")
vcf = pysam.VariantFile("test.vcf")
genome = pysam.FastaFile("genome.fasta")
L=[]
flank=10
for record in vcf:
seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])
L.append(t)
for i in L:
seq,chr,pos,id,ref,alt=i
sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])
line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)
f.write(line+'\n')
f.close()
Hello Denis,
Probably, some additional optimisation and syntax polishing are required
if you like, here are my notes about this:
- Use python3 :) Whenever you write a new script with python take python3 instead of python2. No one knows how long python2 will be supported
- In bioinformatics it is common that program writes to stdout by default. Doing so the user can decide whether it should be piped to another program or the output is redirected to a file with
$ script.py > outputfile
- When working with reading/writing files it is a good style to work with the
with
statement. Doing so you don't need to take care of closing files. - Theres no need to first append your data to a list and iterate afterwards over that list to write the output to the file. Just write it directly to it.
fin swimmer
Improved version
from __future__ import print_function
import pysam
vcf = pysam.VariantFile("Sun_CDS.vcf")
genome = pysam.FastaFile("genome.fasta")
flank=62
with open("SNP_CDS.txt", "w") as f:
for record in vcf:
seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])
seq,chr,pos,id,ref,alt=t
sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])
line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)
f.write(line+'\n')
Hi Denis, I used following code to do exactly what you are doing using pysam. Problem is my code is not very fast for bigger genomes. So I switched to the aforementioned code posted by you. It works when you set a lower flank score. Suppose I need a flanking sequence of 500bp on 5' ......pos......3' and I have first variant at pos [25] it throws an error. Could you suggest me a way to deal with this problem? How can I modify your code that if the variant is a pos 25 and it does not finds a 500 bp flanking sequence at start, it must pick the total bases present at start.
My code is given below. It works for smaller genomes but takes longer for large genomes and It also works for variants having flanking sequence present at start less than the mentioned flank value.
fr=open("test.vcf","r")
a=fr.read()
aa=a.split("#CHROM")
aaa=aa[1].split("\n")
seq=''
header=''
sequence={}
new_dict={}
with open("genome.fasta") as fs:
for linee in fs:
if linee.startswith(">"):
header=linee.split("\n")[0]
seq=''
else:
seq=seq+linee.split("\n")[0]
sequence.update({header:seq})
for line in aaa[1:-1]:
index1 = line.split('\t')[0]
index= line.split()
head=">"+index[0]+" "+index[1]+" "+index[2]+" "+index[3]+" "+index[4]+" "+" "+index[5]+" "+index[6]+" "+index[7]+" "+index[8]+" "+index[9]+" "+index[10]+" "+index[11]+" "+index[12]+" "+" "+index[13]+" "+index[14]+" "+index[15]+" "+index[16]+" "+index[17]
for keys, values in sequence.items():
if index1 in keys:
site=int(index[1])-1
start=abs(site-500) #5' direction
end=site+500 #3' direction
seqq=values[start:end+1]
new_dict.update({head:seqq})
print(new_dict)
fw=open("sequences_per_variants.txt","w")
for base,basee in new_dict.items():
fw.write(str(base)+" "+str(basee)+"\n")
fw.close()