Extract SNPs flanking sequences based on VCF and genome Fasta files
4
2
Entering edit mode
4.0 years ago
Denis ▴ 230

I have VCF file (containing diallelic variants) and reference genome in Fasta of some non-model plant. I'd like to extract SNPs flanking sequences. I've found that bedtools and samtools faidx could be to some extent useful, but apparently don't solve the issue. I need to get output (fasta or tabular file) with following SNPs representation : TCTCTGCCAATCACTAGAGGCCGCTTTCGCTTTTA[A/G]TTTGTGTGTGGTCAGAGTTCTTCCGGACTTT

snp sequence genome • 4.7k views
5
Entering edit mode
4.0 years ago

Here a little python solution:

import pysam

# open vcf file
vcf = pysam.VariantFile("input.vcf")
# open fasta file
genome = pysam.FastaFile("genome.fa")
# define by how many bases the variant should be flanked
flank = 50

# iterate over each variant
for record in vcf:
# extract sequence
#
# The start position is calculated by subtract the number of bases
# given by 'flank' from the variant position. The position in the vcf file
# is 1-based. pysam's fetch() expected 0-base coordinate. That's why we
# need to subtract on more base.
#
# The end position is calculated by adding the number of bases
# given by 'flank' to the variant position. We also need to add the length
# of the REF value and subtract again 1 due to the 0-based/1-based thing.
#
# Now we have the complete sequence like this:
# [number of bases given by flank]+REF+[number of bases given by flank]
seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

# print out tab seperated columns:
# CRHOM, POS, REF, ALT, flanking sequencing with variant given in the format '[REF/ALT]'
print(
record.chrom,
record.pos,
record.ref,
record.alts[0],
'{}[{}/{}]{}'.format(seq[:flank], record.ref, record.alts[0], seq[flank+len(record.ref):]),
sep="\t"
)


fin swimmer

EDIT:

I rewrote the code. The version before had an off-by-one error.

0
Entering edit mode

@finswimmer Hi! Many thanks. I need some time for the code understanding.

1
Entering edit mode

2
Entering edit mode
4.0 years ago

A one-liner using biostar251649 ( http://lindenb.github.io/jvarkit/Biostar251649.html ) and bioalcidaejdk ( http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html )

    $java -jar dist/biostar251649.jar -R src/test/resources/rotavirus_rf.fa src/test/resources/rotavirus_rf.vcf.gz -n 20 |\ java -jar dist/bioalcidaejdk.jar -F VCF -e 'stream().forEach(V->println(">"+V.getContig()+":"+V.getStart()+"\n"+V.getAttribute("SEQ5_20")+"["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]"+V.getAttribute("SEQ3_20")));'  first command adds two attribute to the VCF containing the flanking sequences. the second command: • stream() convert to a stream of variant • forEach(V->println(">" for each variant start writing the fasta header • V.getContig()+":"+V.getStart()+"\n" write the fasta header • V.getAttribute("SEQ5_20") write 5' seq • ["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]" write all alleles separated with '/' • V.getAttribute("SEQ3_20"))) write 3' seq output: >RF01:970 TGGTCGATTGCTCTATTGAA[A/C]AATTTCCATTGATGGCTAAA >RF02:251 TGCTTGAAGTTTTGAAAACA[A/T]AAGAAGAGCACCAAAAAGAG >RF02:578 TACTATTAAAAGATATGGCA[G/A]TTGAAAATAAAAATTCGAGA >RF02:877 GACGTTAACTATATACTTAA[T/A]ATGGACAGAAATCTGCCATC >RF02:1726 AATATGCAACATGTTCAGAC[T/G]TTGACAACAGAAAAATTACA >RF02:1962 TGAAGATTTTTTAAAAAGAT[TACA/TA]TATTTTCGATGTAGCTAGAG >RF03:1221 TGACTGCCATGGATTTAGAA[C/G]TACCGATATCTGCGAAATTA >RF03:1242 TACCGATATCTGCGAAATTA[C/A]TACACCACCCAACTACAGAA >RF03:1688 TCAAAACCAACTGGAAATAA[T/G]CATCTGTTCATTTTGAGCGG >RF03:1708 TCATCTGTTCATTTTGAGCG[G/T]TACTAATAAATATTTTAAAT >RF03:2150 GCCGATGATCCGAACTATTT[T/A]ATTGGAATTAAGTTCAAAAA >RF03:2201 GAATATGATGTTAAAGTACC[G/C]CATCTTACATTTGGTGTATT  ADD COMMENT 0 Entering edit mode 4.0 years ago Denis ▴ 230 I slightly modified the script provided by finswimmer. Now it's compatible with python 2.X, writes output to the text file instead of STDOUT. Besides, ID of each SNP is present in output file as separate field. Probably, some additional optimisation and syntax polishing are required, but even this version works fine for me. from __future__ import print_function import pysam f = open("SNP_CDS.txt", "w") vcf = pysam.VariantFile("test.vcf") genome = pysam.FastaFile("genome.fasta") L=[] flank=10 for record in vcf: seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank) t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0]) L.append(t) for i in L: seq,chr,pos,id,ref,alt=i sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):]) line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence) f.write(line+'\n') f.close()  ADD COMMENT 2 Entering edit mode Hello Denis, Probably, some additional optimisation and syntax polishing are required if you like, here are my notes about this: 1. Use python3 :) Whenever you write a new script with python take python3 instead of python2. No one knows how long python2 will be supported 2. In bioinformatics it is common that program writes to stdout by default. Doing so the user can decide whether it should be piped to another program or the output is redirected to a file with $ script.py > outputfile
3. When working with reading/writing files it is a good style to work with the with statement. Doing so you don't need to take care of closing files.
4. Theres no need to first append your data to a list and iterate afterwards over that list to write the output to the file. Just write it directly to it.

fin swimmer

0
Entering edit mode

Hello finswimmer! Agree! Many thanks for your comments! They are extremely useful. I've changed the code (added with statement). Please see the newer version below. I'm wondering how can i adjust the code regarding your the 4-th point? Could you do it please?

0
Entering edit mode
4.0 years ago
Denis ▴ 230

Improved version

from __future__ import print_function
import pysam

vcf = pysam.VariantFile("Sun_CDS.vcf")

genome = pysam.FastaFile("genome.fasta")

flank=62

with open("SNP_CDS.txt", "w") as f:

for record in vcf:

seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])

seq,chr,pos,id,ref,alt=t

sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

f.write(line+'\n')

0
Entering edit mode

Hi Denis, I used following code to do exactly what you are doing using pysam. Problem is my code is not very fast for bigger genomes. So I switched to the aforementioned code posted by you. It works when you set a lower flank score. Suppose I need a flanking sequence of 500bp on 5' ......pos......3' and I have first variant at pos [25] it throws an error. Could you suggest me a way to deal with this problem? How can I modify your code that if the variant is a pos 25 and it does not finds a 500 bp flanking sequence at start, it must pick the total bases present at start.

My code is given below. It works for smaller genomes but takes longer for large genomes and It also works for variants having flanking sequence present at start less than the mentioned flank value.

fr=open("test.vcf","r")
aa=a.split("#CHROM")
aaa=aa[1].split("\n")
seq=''
sequence={}
new_dict={}
with open("genome.fasta") as fs:

for linee in fs:

if linee.startswith(">"):

seq=''

else:
seq=seq+linee.split("\n")[0]

for line in aaa[1:-1]:
index1 = line.split('\t')[0]
index= line.split()

head=">"+index[0]+" "+index[1]+" "+index[2]+" "+index[3]+" "+index[4]+" "+" "+index[5]+" "+index[6]+" "+index[7]+" "+index[8]+" "+index[9]+" "+index[10]+" "+index[11]+" "+index[12]+" "+" "+index[13]+" "+index[14]+" "+index[15]+" "+index[16]+" "+index[17]

for keys, values in sequence.items():
if index1 in keys:
site=int(index[1])-1

start=abs(site-500)  #5' direction
end=site+500   #3' direction
seqq=values[start:end+1]
print(new_dict)

fw=open("sequences_per_variants.txt","w")
for base,basee in new_dict.items():
fw.write(str(base)+" "+str(basee)+"\n")
fw.close()

0
Entering edit mode

Hi, It seems that you have to add additional block with if statement to check if record.pos-1-flank is greater than 0. Perhaps the same aproach and for record.pos-1+len(record.ref)+flank it should be smaler or at least equal than chromosome length. But i'm not sure as i din't run it.