Question: Extract SNPs flanking sequences based on VCF and genome Fasta files
1
gravatar for Denis
10 months ago by
Denis100
Denis100 wrote:

I have VCF file (containing diallelic variants) and reference genome in Fasta of some non-model plant. I'd like to extract SNPs flanking sequences. I've found that bedtools and samtools faidx could be to some extent useful, but apparently don't solve the issue. I need to get output (fasta or tabular file) with following SNPs representation : TCTCTGCCAATCACTAGAGGCCGCTTTCGCTTTTA[A/G]TTTGTGTGTGGTCAGAGTTCTTCCGGACTTT

snp sequence genome • 1.0k views
ADD COMMENTlink modified 10 months ago by Pierre Lindenbaum121k • written 10 months ago by Denis100
4
gravatar for finswimmer
10 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Here a little python solution:

import pysam


# open vcf file
vcf = pysam.VariantFile("input.vcf")
# open fasta file
genome = pysam.FastaFile("genome.fa")
# define by how many bases the variant should be flanked
flank = 50

# iterate over each variant
for record in vcf:
    # extract sequence
    #
    # The start position is calculated by subtract the number of bases
    # given by 'flank' from the variant position. The position in the vcf file
    # is 1-based. pysam's fetch() expected 0-base coordinate. That's why we
    # need to subtract on more base.
    #
    # The end position is calculated by adding the number of bases
    # given by 'flank' to the variant position. We also need to add the length
    # of the REF value and subtract again 1 due to the 0-based/1-based thing.
    #
    # Now we have the complete sequence like this:
    # [number of bases given by flank]+REF+[number of bases given by flank]
    seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

    # print out tab seperated columns:
    # CRHOM, POS, REF, ALT, flanking sequencing with variant given in the format '[REF/ALT]'
    print(
        record.chrom,
        record.pos,
        record.ref,
        record.alts[0],
        '{}[{}/{}]{}'.format(seq[:flank], record.ref, record.alts[0], seq[flank+len(record.ref):]),
        sep="\t"
    )

fin swimmer

EDIT:

I rewrote the code. The version before had an off-by-one error.

ADD COMMENTlink modified 10 months ago • written 10 months ago by finswimmer11k

@finswimmer Hi! Many thanks. I need some time for the code understanding.

ADD REPLYlink written 10 months ago by Denis100

I added some comments to the code. If something isn't clear just ask.

ADD REPLYlink written 10 months ago by finswimmer11k
2
gravatar for Pierre Lindenbaum
10 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

A one-liner using biostar251649 ( http://lindenb.github.io/jvarkit/Biostar251649.html ) and bioalcidaejdk ( http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html )

    $ java -jar dist/biostar251649.jar -R src/test/resources/rotavirus_rf.fa src/test/resources/rotavirus_rf.vcf.gz -n 20 |\
    java -jar dist/bioalcidaejdk.jar -F VCF -e 'stream().forEach(V->println(">"+V.getContig()+":"+V.getStart()+"\n"+V.getAttribute("SEQ5_20")+"["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]"+V.getAttribute("SEQ3_20")));'

first command adds two attribute to the VCF containing the flanking sequences.

the second command:

  • stream() convert to a stream of variant

  • forEach(V->println(">" for each variant start writing the fasta header

  • V.getContig()+":"+V.getStart()+"\n" write the fasta header

  • V.getAttribute("SEQ5_20") write 5' seq

  • ["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]" write all alleles separated with '/'

  • V.getAttribute("SEQ3_20"))) write 3' seq

output:

>RF01:970
TGGTCGATTGCTCTATTGAA[A/C]AATTTCCATTGATGGCTAAA
>RF02:251
TGCTTGAAGTTTTGAAAACA[A/T]AAGAAGAGCACCAAAAAGAG
>RF02:578
TACTATTAAAAGATATGGCA[G/A]TTGAAAATAAAAATTCGAGA
>RF02:877
GACGTTAACTATATACTTAA[T/A]ATGGACAGAAATCTGCCATC
>RF02:1726
AATATGCAACATGTTCAGAC[T/G]TTGACAACAGAAAAATTACA
>RF02:1962
TGAAGATTTTTTAAAAAGAT[TACA/TA]TATTTTCGATGTAGCTAGAG
>RF03:1221
TGACTGCCATGGATTTAGAA[C/G]TACCGATATCTGCGAAATTA
>RF03:1242
TACCGATATCTGCGAAATTA[C/A]TACACCACCCAACTACAGAA
>RF03:1688
TCAAAACCAACTGGAAATAA[T/G]CATCTGTTCATTTTGAGCGG
>RF03:1708
TCATCTGTTCATTTTGAGCG[G/T]TACTAATAAATATTTTAAAT
>RF03:2150
GCCGATGATCCGAACTATTT[T/A]ATTGGAATTAAGTTCAAAAA
>RF03:2201
GAATATGATGTTAAAGTACC[G/C]CATCTTACATTTGGTGTATT
ADD COMMENTlink modified 10 months ago • written 10 months ago by Pierre Lindenbaum121k

V.getAttribute("SEQ3_20"))) write 5' seq

Assuming a typo here: 5' -> 3'

ADD REPLYlink written 10 months ago by WouterDeCoster40k

fixed . Thanks

ADD REPLYlink written 10 months ago by Pierre Lindenbaum121k
0
gravatar for Denis
10 months ago by
Denis100
Denis100 wrote:

I slightly modified the script provided by finswimmer. Now it's compatible with python 2.X, writes output to the text file instead of STDOUT. Besides, ID of each SNP is present in output file as separate field. Probably, some additional optimisation and syntax polishing are required, but even this version works fine for me.

from __future__ import print_function
import pysam



f = open("SNP_CDS.txt", "w")

vcf = pysam.VariantFile("test.vcf")

genome = pysam.FastaFile("genome.fasta")

L=[]

flank=10



for record in vcf:

    seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
    t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])
    L.append(t)



for i in L:

    seq,chr,pos,id,ref,alt=i

    sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

    line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

    f.write(line+'\n')



f.close()
ADD COMMENTlink modified 10 months ago • written 10 months ago by Denis100
2

Hello Denis,

Probably, some additional optimisation and syntax polishing are required

if you like, here are my notes about this:

  1. Use python3 :) Whenever you write a new script with python take python3 instead of python2. No one knows how long python2 will be supported
  2. In bioinformatics it is common that program writes to stdout by default. Doing so the user can decide whether it should be piped to another program or the output is redirected to a file with $ script.py > outputfile
  3. When working with reading/writing files it is a good style to work with the with statement. Doing so you don't need to take care of closing files.
  4. Theres no need to first append your data to a list and iterate afterwards over that list to write the output to the file. Just write it directly to it.

fin swimmer

ADD REPLYlink written 10 months ago by finswimmer11k

Hello finswimmer! Agree! Many thanks for your comments! They are extremely useful. I've changed the code (added with statement). Please see the newer version below. I'm wondering how can i adjust the code regarding your the 4-th point? Could you do it please?

ADD REPLYlink modified 10 months ago • written 10 months ago by Denis100
0
gravatar for Denis
10 months ago by
Denis100
Denis100 wrote:

Improved version

from __future__ import print_function
import pysam


    vcf = pysam.VariantFile("Sun_CDS.vcf")

    genome = pysam.FastaFile("genome.fasta")

    flank=62


    with open("SNP_CDS.txt", "w") as f:

        for record in vcf:

            seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

            t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])

            seq,chr,pos,id,ref,alt=t

            sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

            line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

            f.write(line+'\n')
ADD COMMENTlink modified 10 months ago • written 10 months ago by Denis100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 928 users visited in the last hour