Question: Standard Genome Plus Vcf To Variant Genome
2
gravatar for Darked89
7.7 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

I have to do some work with an organism doing transplicing and RNA-Seq (50bp, stranded but unpaired). It is fairly easy to map most of the RNA-Seq reads (no introns), but I want to get more out of this data, in this case transcription starts.

I managed to extract reads containing splice leader (SL) or its part, and mapped these to standard genome for this organism. Problem is that for some genes I can not find even a semi-plausible mappings upstream of their ORFs. I have tried 5 different mappers for that purpose (last, bowtie, GEM, bfast and bwa => results differ) but lack of some mappings is not due to algorithm used.

Since there are some differences between strains, I was thinking about constructing genome sequence closer to my strain, by using mappings from the whole RNA-Seq, looking for SNPs/indels, then picking the most frequent variant in my data set for constructing new genome sequence. That way I hope to cope with genome differences which may affect mappings of SL-containing reds. BTW, I remove the SL sequence, so the reads I am trying to map are 50bp minus 8 up to 39bp shorter.

Question: are there any tools for: reference.genome.fas + VCF => my_strain.genome.fas conversion?

EDIT GATK has a tool (FastaAlternateReferenceMaker) described here but it required input is in ".rod" format. Thanks to swbarnes2 for pointing me to GATK.

EDIT 2 Just to clarify how I got VCF file:

samtools mpileup -ugf my_reference_genome.fa s_1.bam s_2.bam s_3.bam s_4.bam | bcftools view -bvcg - > s1-4.var.raw.bcf 
bcftools view s1-4.var.raw.bcf   | vcfutils.pl varFilter -D 100 > s1-4.var.flt.vcf

EDIT 3

The answer is:

java -jar ~/soft/GenomeAnalysisTK-1.0-6121-g8a78414/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R my_reference_genome.fa
-o my_NEW_genome.fa -B:testsnp,VCF s1-4.var.flt.vcf

EDIT 4 GATK changed syntax:

java -jar ~/soft/GenomeAnalysisTK_1.4/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R  my_reference_genome.fa -o  my_NEW_genome.fa --variant  s1-4.var.flt.vcf
genome vcf • 4.3k views
ADD COMMENTlink modified 7.2 years ago by Swbarnes21.4k • written 7.7 years ago by Darked894.2k
2

Does it handle indels? I guess not?

ADD REPLYlink written 7.7 years ago by lh331k
1

See also: http://biostar.stackexchange.com/questions/6576/new-fasta-sequence-from-reference-fasta-and-variant-calls-file (GATK's FastaAlternateReferenceMaker as accepted answer)

ADD REPLYlink written 7.7 years ago by Gaffa490

@gaffa: many thanks. I fixed the link in the answer to your question.

ADD REPLYlink written 7.7 years ago by Darked894.2k

@lh3: I just did the test selecting from s1-4.var.flt.vcf only entries with quality 999 and "INDEL". The resulting fasta file shows differences with diff. So I am quite positive it must incorporate indels for creating alternative reference.

ADD REPLYlink written 7.7 years ago by Darked894.2k
2
gravatar for Swbarnes2
7.7 years ago by
Swbarnes21.4k
Swbarnes21.4k wrote:

Yes, but they aren't considered great. samtools/bcftools vcf2fq will do that, but it will simply ignore indels. GATK might also have one, but I'm not as familiar with what software, so I'm not sure how well it works.

ADD COMMENTlink written 7.7 years ago by Swbarnes21.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 933 users visited in the last hour