Question

Standard Genome Plus Vcf To Variant Genome

2

Entering edit mode

12.7 years ago

Darked89 4.6k

I have to do some work with an organism doing transplicing and RNA-Seq (50bp, stranded but unpaired). It is fairly easy to map most of the RNA-Seq reads (no introns), but I want to get more out of this data, in this case transcription starts.

I managed to extract reads containing splice leader (SL) or its part, and mapped these to standard genome for this organism. Problem is that for some genes I can not find even a semi-plausible mappings upstream of their ORFs. I have tried 5 different mappers for that purpose (last, bowtie, GEM, bfast and bwa => results differ) but lack of some mappings is not due to algorithm used.

Since there are some differences between strains, I was thinking about constructing genome sequence closer to my strain, by using mappings from the whole RNA-Seq, looking for SNPs/indels, then picking the most frequent variant in my data set for constructing new genome sequence. That way I hope to cope with genome differences which may affect mappings of SL-containing reds. BTW, I remove the SL sequence, so the reads I am trying to map are 50bp minus 8 up to 39bp shorter.

Question: are there any tools for: reference.genome.fas + VCF => my_strain.genome.fas conversion?

EDIT GATK has a tool (FastaAlternateReferenceMaker) described here but it required input is in ".rod" format. Thanks to swbarnes2 for pointing me to GATK.

EDIT 2 Just to clarify how I got VCF file:

samtools mpileup -ugf my_reference_genome.fa s_1.bam s_2.bam s_3.bam s_4.bam | bcftools view -bvcg - > s1-4.var.raw.bcf 
bcftools view s1-4.var.raw.bcf   | vcfutils.pl varFilter -D 100 > s1-4.var.flt.vcf

EDIT 3

The answer is:

java -jar ~/soft/GenomeAnalysisTK-1.0-6121-g8a78414/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R my_reference_genome.fa
-o my_NEW_genome.fa -B:testsnp,VCF s1-4.var.flt.vcf

EDIT 4 GATK changed syntax:

java -jar ~/soft/GenomeAnalysisTK_1.4/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R  my_reference_genome.fa -o  my_NEW_genome.fa --variant  s1-4.var.flt.vcf

genome vcf • 5.9k views

ADD COMMENT • link updated 12.3 years ago by Swbarnes2 ★ 1.6k • written 12.7 years ago by Darked89 4.6k

2

Entering edit mode

Does it handle indels? I guess not?

ADD REPLY • link 12.7 years ago by lh3 33k

1

Entering edit mode

See also: New Fasta Sequence From Reference Fasta And Variant Calls File? (GATK's FastaAlternateReferenceMaker as accepted answer)

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 12.7 years ago by Gaffa ▴ 500

0

Entering edit mode

@gaffa: many thanks. I fixed the link in the answer to your question.

ADD REPLY • link 12.7 years ago by Darked89 4.6k

0

Entering edit mode

@lh3: I just did the test selecting from s1-4.var.flt.vcf only entries with quality 999 and "INDEL". The resulting fasta file shows differences with diff. So I am quite positive it must incorporate indels for creating alternative reference.

ADD REPLY • link 12.7 years ago by Darked89 4.6k

score 2 · Answer 1 · 2011-08-11

2

Entering edit mode

12.7 years ago

Swbarnes2 ★ 1.6k

Yes, but they aren't considered great. samtools/bcftools vcf2fq will do that, but it will simply ignore indels. GATK might also have one, but I'm not as familiar with what software, so I'm not sure how well it works.

ADD COMMENT • link 12.7 years ago by Swbarnes2 ★ 1.6k