Deleted:Input on Protein Prediction from NGS data?
Entering edit mode
6 weeks ago
Ared445 ▴ 30

I have RNA-seq & exome data from various tumor samples. This has been my flow:

  1. Align to reference using STAR
  2. Get SNPs + InDels using Samtools -> VarScan2 = 7000-50000 variants per sample
  3. Generate tumor reference using .vcf + GATK FastaAlternateReferenceMaker
  4. Use Bedtools to extend SNP + indel intervals and pull sequences in fasta format
  5. Translate fastas to protein

The 2 challenges I run into are that 1) GATK tool naturally alters the chromosome labels such that when I pull with the .vcf file later, it is not pulling some percentage of variants. I have been manually trying to fix this. 2) I suspect I may be altering the sequence of protein by adding nucleotides to each side of the interval and not accounting for where the reading frame starts?

Could anyone suggest a possibly more efficient way to retrieve nucleotide fastas for translation using information on position of SNPs + Indels? I am quite new to bioinformatics.

I would like to use novel transcript data to predict novel proteins as well soon.

Thank you very much.

Workflow Advice • 72 views
