I have to do some work with an organism doing transplicing and RNA-Seq (50bp, stranded but unpaired). It is fairly easy to map most of the RNA-Seq reads (no introns), but I want to get more out of this data, in this case transcription starts.
I managed to extract reads containing splice leader (SL) or its part, and mapped these to standard genome for this organism. Problem is that for some genes I can not find even a semi-plausible mappings upstream of their ORFs. I have tried 5 different mappers for that purpose (last, bowtie, GEM, bfast and bwa => results differ) but lack of some mappings is not due to algorithm used.
Since there are some differences between strains, I was thinking about constructing genome sequence closer to my strain, by using mappings from the whole RNA-Seq, looking for SNPs/indels, then picking the most frequent variant in my data set for constructing new genome sequence. That way I hope to cope with genome differences which may affect mappings of SL-containing reds. BTW, I remove the SL sequence, so the reads I am trying to map are 50bp minus 8 up to 39bp shorter.
Question: are there any tools for: reference.genome.fas + VCF => my_strain.genome.fas conversion?
EDIT GATK has a tool (FastaAlternateReferenceMaker) described here but it required input is in ".rod" format. Thanks to swbarnes2 for pointing me to GATK.
EDIT 2 Just to clarify how I got VCF file:
samtools mpileup -ugf my_reference_genome.fa s_1.bam s_2.bam s_3.bam s_4.bam | bcftools view -bvcg - > s1-4.var.raw.bcf bcftools view s1-4.var.raw.bcf | vcfutils.pl varFilter -D 100 > s1-4.var.flt.vcf
The answer is:
java -jar ~/soft/GenomeAnalysisTK-1.0-6121-g8a78414/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R my_reference_genome.fa -o my_NEW_genome.fa -B:testsnp,VCF s1-4.var.flt.vcf
EDIT 4 GATK changed syntax:
java -jar ~/soft/GenomeAnalysisTK_1.4/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R my_reference_genome.fa -o my_NEW_genome.fa --variant s1-4.var.flt.vcf
Does it handle indels? I guess not?
See also: New Fasta Sequence From Reference Fasta And Variant Calls File? (GATK's FastaAlternateReferenceMaker as accepted answer)
@gaffa: many thanks. I fixed the link in the answer to your question.
@lh3: I just did the test selecting from s1-4.var.flt.vcf only entries with quality 999 and "INDEL". The resulting fasta file shows differences with diff. So I am quite positive it must incorporate indels for creating alternative reference.