Question: New Fasta Sequence From Reference Fasta And Variant Calls File?
16
gravatar for Gaffa
8.2 years ago by
Gaffa490
Gaffa490 wrote:

Is there some utility that will take as input a reference sequence in fasta format and a file of variant calls, with both SNPs and short indels (i.e. VCF or similar), and output a new sequence, identical to the input reference except with the new variants introduced?

It wouldn't be a super tricky thing to script, but since it seems like something that would be relatively common to want to do, I'm wondering if there isn't already some available tools for doing this. Yet I haven't found any.

For simplicity let's assume a single haploid chromosome (trickier for diploid individuals with heterozygous calls).

Example:

reference sequence:
AAATTTAGAA

variant calls:
POS  REF ALT
2    A    C
5    TT    T
8    G    GCC
10    A    T

new sequence:
ACATTAGCCAT
vcf fasta consensus variant • 26k views
ADD COMMENTlink modified 2.5 years ago by FatihSarigol120 • written 8.2 years ago by Gaffa490

Hi gaffa. Would you mind putting a few example sequences and variants and resulting sequences? For a given input sequence, do you want only one output sequence or many containing all possible combinations of the variants for that sequence? Cheers

ADD REPLYlink written 8.2 years ago by Eric Normandeau10k

If I want all the possible combinations, is there a tool for doing that?
 

ADD REPLYlink written 4.8 years ago by eva10

@Eric: I have added an example. For a given input sequence I want only a single output sequence. I.e. I have sequenced some individual, called variants in relation to a reference genome, and now I want to construct the full sequence of the individual (for simplicity assume a single haploid chromosome).

ADD REPLYlink written 8.2 years ago by Gaffa490
13
gravatar for Vasisht
8.1 years ago by
Vasisht150
Vasisht150 wrote:

GATK has an utility to create an alternate reference fasta file

GATK Fasta Alternate Reference

ADD COMMENTlink modified 24 months ago by WouterDeCoster39k • written 8.1 years ago by Vasisht150
7

Link provided above is broken now; instead use http://www.broadinstitute.org/gsa/gatkdocs/release/org_broadinstitute_sting_gatk_walkers_fasta_FastaAlternateReferenceWalker.html

ADD REPLYlink written 7.7 years ago by David Quigley11k

New link: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_fasta_FastaAlternateReference.html

ADD REPLYlink written 6.7 years ago by Moss20

And again a link fix: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_fasta_FastaAlternateReferenceMaker.html

ADD REPLYlink written 6.2 years ago by Konrad690

And again a new one

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD REPLYlink written 4.6 years ago by Amy10
3

And again:

https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD REPLYlink written 3.3 years ago by lexnederbragt1.2k
1

And now I fixed the link in the original answer :)

ADD REPLYlink written 24 months ago by WouterDeCoster39k

Thanks, that's exactly what I was looking for!

ADD REPLYlink written 8.0 years ago by Gaffa490

just saying one should be careful about this tool as it may shift frames in positions where there should be N's

ADD REPLYlink written 23 months ago by FatihSarigol120
6
gravatar for Jorge Amigo
6.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

although this is an old post, I can think of many users landing here seeking for advice, so allow me to add that vcf-tools has this precisely capability embeded in its vcf-consensus

ADD COMMENTlink written 6.0 years ago by Jorge Amigo11k

I am trying to the same but getting an error. Can you help me out stepwise ? I am not good at this stuff. I have a reference fasta and the vcf file. Now need to generate the sequence using it

ADD REPLYlink written 3.1 years ago by mohammedabeddin110

there are now many other tools to generate consensus from vcf file, but if you want to do it with vcf-consensus, this is the way to do it (from the vcftools manual):

cat ref.fa | vcf-consensus file.vcf.gz > out.fa
ADD REPLYlink written 3.1 years ago by Jorge Amigo11k

This error: cat: 123.fa: Is a directory �9=���w��9��w������9������A�9��A�3�K�9�3�K�����9����$c \ Broken VCF header, no column names? at /usr/share/perl5/Vcf.pm line 177, <__ANONIO__> line 1. Vcf::throw('Vcf4_1=HASH(0x12b2a48)', 'Broken VCF header, no column names?') called at /usr/share/perl5/Vcf.pm line 869 VcfReader::_read_column_names('Vcf4_1=HASH(0x12b2a48)') called at /usr/share/perl5/Vcf.pm line 604 VcfReader::parse_header('Vcf4_1=HASH(0x12b2a48)') called at /usr/bin/vcf-consensus line 54 main::do_consensus('HASH(0xdd5cb8)') called at /usr/bin/vcf-consensus line 9

ADD REPLYlink written 3.1 years ago by mohammedabeddin110

it seems like you may have extracted the reference into a folder, and that the vcf file you're using is wrongly formatted too. but since you've already been through this answer, I would stick to it since the samtools + bcftools option is much faster and efficient than cat + vcf-tools, plus it allows you to deal with compressed files directly.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Jorge Amigo11k
5
gravatar for Francisco Roque
8.2 years ago by
Bergen, Norway
Francisco Roque110 wrote:

This will not apply directly to the vcf file, but using extra annotation from bcftools (assuming you are using it for snp and indel calling):

samtools mpileup -uf reference.fasta alignment.bam | \
bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fastq

After that just convert the fastq into a consensus fasta using seqtk, converting all bases with quality lower than 20 to lowercase

seqtk fq2fa consensus.fastq 20 > consensus.fasta

Note: This answer was adapted from http://biostar.stackexchange.com/questions/1389/how-to-generate-a-consensus-fasta-sequence-from-sam-tools-pileup

ADD COMMENTlink written 8.2 years ago by Francisco Roque110
1
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

Just for fun, using XSLT. The following stylesheet inserts/deletes a sequence from a TSeq_sequence record.


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
        version='1.0'
        >

<xsl:output method="xml" indent="yes"/>
<xsl:param name="index" select="0"/>
<xsl:param name="length" select="0"/>
<xsl:param name="insert"></xsl:param>
<xsl:variable name="smallcase" select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>

<xsl:template match="/">
<xsl:apply-templates select="*"/>
</xsl:template>

<xsl:template match="*"> 
<xsl:copy>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:copy> 
</xsl:template>

<xsl:template match="TSeq_sequence">
<TSeq_sequence>
    <xsl:choose>
        <xsl:when test="$index &gt; 0">
            <xsl:variable name="seq" select="translate(normalize-space(text()), $uppercase, $smallcase)"/>
            <xsl:value-of select="substring($seq,1,$index - 1)"/>
            <xsl:value-of select="translate($insert, $smallcase, $uppercase)"/>
            <xsl:value-of select="substring($seq,$index + $length)"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="."/>
        </xsl:otherwise>
    </xsl:choose>
</TSeq_sequence>
</xsl:template>

</xsl:stylesheet>

Example:

xsltproc --param index '2'  \
    --param length 1 \
    --stringparam insert "HelloWorld" \
    --novalid jeter.xsl  \
    "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=xml&id=112950042"


<TSeqSet>
<TSeq>
(...)
<TSeq_sequence>
  cHELLOWORLDtgctttcagttgctaccgggtcatgccgagcactctgaaaggactgcccaggataacggggaggaaggtggggatgacgtcaagtcancatggcctttatgcctggggccacacacttgctaccctggcccgtaccaagcgctgc</TSeq_sequence>
</TSeq>

</TSeqSet>
ADD COMMENTlink written 8.2 years ago by Pierre Lindenbaum120k
0
gravatar for castelli
3.0 years ago by
castelli0
castelli0 wrote:

You may use vcfx. It creates a fasta file (two sequences per sample) using a reference sequence and replacing each variable site on the right location. It supports indels. Take a look here: www.castelli-lab.net/apps/vcfx

ADD COMMENTlink written 3.0 years ago by castelli0
1

Again: C: VCF to FASTA

ADD REPLYlink written 3.0 years ago by Matt Shirley9.0k
0
gravatar for FatihSarigol
2.5 years ago by
FatihSarigol120
Durham
FatihSarigol120 wrote:

I found a way eventually and wanted to come back to this post to share, and noticed that Francisco Roque's answer was almost the same, but in case that doesn't work, I have just a little bit different version:

samtools mpileup -d8000 -q 20 -Q 10 -uf  REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq  > OUTPUT.fastq

-d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)

Which generated for me a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:

sed -i -e '/^+/,/^\@/{/^+/!{/^\@/!d}}; /^+/ d; s/@/>/g' OUTPUT.fastq

In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.

EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

ADD COMMENTlink modified 15 months ago • written 2.5 years ago by FatihSarigol120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1799 users visited in the last hour