I have RNA-seq & exome data from various tumor samples. This has been my flow:
- Align to reference using STAR
- Get SNPs + InDels using Samtools -> VarScan2 = 7000-50000 variants per sample
- Generate tumor reference using .vcf + GATK FastaAlternateReferenceMaker
- Use Bedtools to extend SNP + indel intervals and pull sequences in fasta format
- Translate fastas to protein
The 2 challenges I run into are that 1) GATK tool naturally alters the chromosome labels such that when I pull with the .vcf file later, it is not pulling some percentage of variants. I have been manually trying to fix this. 2) I suspect I may be altering the sequence of protein by adding nucleotides to each side of the interval and not accounting for where the reading frame starts?
Could anyone suggest a possibly more efficient way to retrieve nucleotide fastas for translation using information on position of SNPs + Indels? I am quite new to bioinformatics.
I would like to use novel transcript data to predict novel proteins as well soon.
Thank you very much.