Deleted:Input on Protein Prediction from NGS data?
0
1
Entering edit mode
2.9 years ago
Ared445 ▴ 60

I have RNA-seq & exome data from various tumor samples. This has been my flow:

  1. Align to reference using STAR
  2. Get SNPs + InDels using Samtools -> VarScan2 = 7000-50000 variants per sample
  3. Generate tumor reference using .vcf + GATK FastaAlternateReferenceMaker
  4. Use Bedtools to extend SNP + indel intervals and pull sequences in fasta format
  5. Translate fastas to protein

The 2 challenges I run into are that 1) GATK tool naturally alters the chromosome labels such that when I pull with the .vcf file later, it is not pulling some percentage of variants. I have been manually trying to fix this. 2) I suspect I may be altering the sequence of protein by adding nucleotides to each side of the interval and not accounting for where the reading frame starts?

Could anyone suggest a possibly more efficient way to retrieve nucleotide fastas for translation using information on position of SNPs + Indels? I am quite new to bioinformatics.

I would like to use novel transcript data to predict novel proteins as well soon.

Thank you very much.

Workflow Advice • 253 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2146 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6