Question: Extract SNPs flanking sequences based on VCF and genome Fasta files
2
gravatar for Denis
24 months ago by
Denis190
Denis190 wrote:

I have VCF file (containing diallelic variants) and reference genome in Fasta of some non-model plant. I'd like to extract SNPs flanking sequences. I've found that bedtools and samtools faidx could be to some extent useful, but apparently don't solve the issue. I need to get output (fasta or tabular file) with following SNPs representation : TCTCTGCCAATCACTAGAGGCCGCTTTCGCTTTTA[A/G]TTTGTGTGTGGTCAGAGTTCTTCCGGACTTT

snp sequence genome • 2.3k views
ADD COMMENTlink modified 23 months ago by Pierre Lindenbaum129k • written 24 months ago by Denis190
5
gravatar for finswimmer
24 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Here a little python solution:

import pysam


# open vcf file
vcf = pysam.VariantFile("input.vcf")
# open fasta file
genome = pysam.FastaFile("genome.fa")
# define by how many bases the variant should be flanked
flank = 50

# iterate over each variant
for record in vcf:
    # extract sequence
    #
    # The start position is calculated by subtract the number of bases
    # given by 'flank' from the variant position. The position in the vcf file
    # is 1-based. pysam's fetch() expected 0-base coordinate. That's why we
    # need to subtract on more base.
    #
    # The end position is calculated by adding the number of bases
    # given by 'flank' to the variant position. We also need to add the length
    # of the REF value and subtract again 1 due to the 0-based/1-based thing.
    #
    # Now we have the complete sequence like this:
    # [number of bases given by flank]+REF+[number of bases given by flank]
    seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

    # print out tab seperated columns:
    # CRHOM, POS, REF, ALT, flanking sequencing with variant given in the format '[REF/ALT]'
    print(
        record.chrom,
        record.pos,
        record.ref,
        record.alts[0],
        '{}[{}/{}]{}'.format(seq[:flank], record.ref, record.alts[0], seq[flank+len(record.ref):]),
        sep="\t"
    )

fin swimmer

EDIT:

I rewrote the code. The version before had an off-by-one error.

ADD COMMENTlink modified 24 months ago • written 24 months ago by finswimmer13k

@finswimmer Hi! Many thanks. I need some time for the code understanding.

ADD REPLYlink written 24 months ago by Denis190

I added some comments to the code. If something isn't clear just ask.

ADD REPLYlink written 24 months ago by finswimmer13k
2
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

A one-liner using biostar251649 ( http://lindenb.github.io/jvarkit/Biostar251649.html ) and bioalcidaejdk ( http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html )

    $ java -jar dist/biostar251649.jar -R src/test/resources/rotavirus_rf.fa src/test/resources/rotavirus_rf.vcf.gz -n 20 |\
    java -jar dist/bioalcidaejdk.jar -F VCF -e 'stream().forEach(V->println(">"+V.getContig()+":"+V.getStart()+"\n"+V.getAttribute("SEQ5_20")+"["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]"+V.getAttribute("SEQ3_20")));'

first command adds two attribute to the VCF containing the flanking sequences.

the second command:

  • stream() convert to a stream of variant

  • forEach(V->println(">" for each variant start writing the fasta header

  • V.getContig()+":"+V.getStart()+"\n" write the fasta header

  • V.getAttribute("SEQ5_20") write 5' seq

  • ["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]" write all alleles separated with '/'

  • V.getAttribute("SEQ3_20"))) write 3' seq

output:

>RF01:970
TGGTCGATTGCTCTATTGAA[A/C]AATTTCCATTGATGGCTAAA
>RF02:251
TGCTTGAAGTTTTGAAAACA[A/T]AAGAAGAGCACCAAAAAGAG
>RF02:578
TACTATTAAAAGATATGGCA[G/A]TTGAAAATAAAAATTCGAGA
>RF02:877
GACGTTAACTATATACTTAA[T/A]ATGGACAGAAATCTGCCATC
>RF02:1726
AATATGCAACATGTTCAGAC[T/G]TTGACAACAGAAAAATTACA
>RF02:1962
TGAAGATTTTTTAAAAAGAT[TACA/TA]TATTTTCGATGTAGCTAGAG
>RF03:1221
TGACTGCCATGGATTTAGAA[C/G]TACCGATATCTGCGAAATTA
>RF03:1242
TACCGATATCTGCGAAATTA[C/A]TACACCACCCAACTACAGAA
>RF03:1688
TCAAAACCAACTGGAAATAA[T/G]CATCTGTTCATTTTGAGCGG
>RF03:1708
TCATCTGTTCATTTTGAGCG[G/T]TACTAATAAATATTTTAAAT
>RF03:2150
GCCGATGATCCGAACTATTT[T/A]ATTGGAATTAAGTTCAAAAA
>RF03:2201
GAATATGATGTTAAAGTACC[G/C]CATCTTACATTTGGTGTATT
ADD COMMENTlink modified 23 months ago • written 23 months ago by Pierre Lindenbaum129k
0
gravatar for Denis
24 months ago by
Denis190
Denis190 wrote:

I slightly modified the script provided by finswimmer. Now it's compatible with python 2.X, writes output to the text file instead of STDOUT. Besides, ID of each SNP is present in output file as separate field. Probably, some additional optimisation and syntax polishing are required, but even this version works fine for me.

from __future__ import print_function
import pysam



f = open("SNP_CDS.txt", "w")

vcf = pysam.VariantFile("test.vcf")

genome = pysam.FastaFile("genome.fasta")

L=[]

flank=10



for record in vcf:

    seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
    t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])
    L.append(t)



for i in L:

    seq,chr,pos,id,ref,alt=i

    sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

    line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

    f.write(line+'\n')



f.close()
ADD COMMENTlink modified 24 months ago • written 24 months ago by Denis190
2

Hello Denis,

Probably, some additional optimisation and syntax polishing are required

if you like, here are my notes about this:

  1. Use python3 :) Whenever you write a new script with python take python3 instead of python2. No one knows how long python2 will be supported
  2. In bioinformatics it is common that program writes to stdout by default. Doing so the user can decide whether it should be piped to another program or the output is redirected to a file with $ script.py > outputfile
  3. When working with reading/writing files it is a good style to work with the with statement. Doing so you don't need to take care of closing files.
  4. Theres no need to first append your data to a list and iterate afterwards over that list to write the output to the file. Just write it directly to it.

fin swimmer

ADD REPLYlink written 23 months ago by finswimmer13k

Hello finswimmer! Agree! Many thanks for your comments! They are extremely useful. I've changed the code (added with statement). Please see the newer version below. I'm wondering how can i adjust the code regarding your the 4-th point? Could you do it please?

ADD REPLYlink modified 23 months ago • written 23 months ago by Denis190
0
gravatar for Denis
23 months ago by
Denis190
Denis190 wrote:

Improved version

from __future__ import print_function
import pysam


    vcf = pysam.VariantFile("Sun_CDS.vcf")

    genome = pysam.FastaFile("genome.fasta")

    flank=62


    with open("SNP_CDS.txt", "w") as f:

        for record in vcf:

            seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

            t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])

            seq,chr,pos,id,ref,alt=t

            sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

            line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

            f.write(line+'\n')
ADD COMMENTlink modified 23 months ago • written 23 months ago by Denis190

Hi Denis, I used following code to do exactly what you are doing using pysam. Problem is my code is not very fast for bigger genomes. So I switched to the aforementioned code posted by you. It works when you set a lower flank score. Suppose I need a flanking sequence of 500bp on 5' ......pos......3' and I have first variant at pos [25] it throws an error. Could you suggest me a way to deal with this problem? How can I modify your code that if the variant is a pos 25 and it does not finds a 500 bp flanking sequence at start, it must pick the total bases present at start.

My code is given below. It works for smaller genomes but takes longer for large genomes and It also works for variants having flanking sequence present at start less than the mentioned flank value.

fr=open("test.vcf","r")
a=fr.read()
aa=a.split("#CHROM")
aaa=aa[1].split("\n")
seq=''
header=''
sequence={}
new_dict={}
with open("genome.fasta") as fs:

for linee in fs:

       if linee.startswith(">"):

            header=linee.split("\n")[0]
            seq=''

       else:
             seq=seq+linee.split("\n")[0]
             sequence.update({header:seq})



for line in aaa[1:-1]:
index1 = line.split('\t')[0]
index= line.split()

head=">"+index[0]+" "+index[1]+" "+index[2]+" "+index[3]+" "+index[4]+" "+" "+index[5]+" "+index[6]+" "+index[7]+" "+index[8]+" "+index[9]+" "+index[10]+" "+index[11]+" "+index[12]+" "+" "+index[13]+" "+index[14]+" "+index[15]+" "+index[16]+" "+index[17]

for keys, values in sequence.items():
    if index1 in keys:
                   site=int(index[1])-1  

                   start=abs(site-500)  #5' direction
                   end=site+500   #3' direction
                   seqq=values[start:end+1]
                   new_dict.update({head:seqq})
  print(new_dict)

  fw=open("sequences_per_variants.txt","w")
   for base,basee in new_dict.items():
   fw.write(str(base)+" "+str(basee)+"\n")
   fw.close()
ADD REPLYlink modified 6 months ago • written 6 months ago by nadiabeg.comsats0

Hi, It seems that you have to add additional block with if statement to check if record.pos-1-flank is greater than 0. Perhaps the same aproach and for record.pos-1+len(record.ref)+flank it should be smaler or at least equal than chromosome length. But i'm not sure as i din't run it.

ADD REPLYlink written 6 months ago by Denis190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour