Question

Deleted:Input on Protein Prediction from NGS data?

1

Entering edit mode

4.1 years ago

Ared445 ▴ 60

I have RNA-seq & exome data from various tumor samples. This has been my flow:

Align to reference using STAR
Get SNPs + InDels using Samtools -> VarScan2 = 7000-50000 variants per sample
Generate tumor reference using .vcf + GATK FastaAlternateReferenceMaker
Use Bedtools to extend SNP + indel intervals and pull sequences in fasta format
Translate fastas to protein

The 2 challenges I run into are that 1) GATK tool naturally alters the chromosome labels such that when I pull with the .vcf file later, it is not pulling some percentage of variants. I have been manually trying to fix this. 2) I suspect I may be altering the sequence of protein by adding nucleotides to each side of the interval and not accounting for where the reading frame starts?

Could anyone suggest a possibly more efficient way to retrieve nucleotide fastas for translation using information on position of SNPs + Indels? I am quite new to bioinformatics.

I would like to use novel transcript data to predict novel proteins as well soon.

Thank you very much.

Workflow Advice • 347 views

ADD COMMENT • link 4.1 years ago by Ared445 ▴ 60