How To Know The Amino Acid Switch That A Snp Caused
3
4
Entering edit mode
11.3 years ago
Abdel ▴ 150

Hello everybody ,exuse my horrible english.

I m writing a perl script to know what is the effect of snp on a given gene, my work plan was to extract all the exon sequences , translate them to amino acid ,concatenate , then doing the some thing with changing only the sequence of the exon where snp is located , finally to align the two amino acid sequence to have something like that...R490W...where R is the wild amino acid , 490 the postion an W the new amino acid.... My problem is that my obtained sequence is totally differente from those given on protein db (i used ncbi to verify), i know i m missing something in the translation step but i don t know what .

first part of the script (gene eg :MYBPC3):

use strict;

use warnings ;

use Bio::Das;

use Bio::Perl;

use Bio::DB::GenBank;

my $protein=""; my$das = Bio::Das->new(-source => 'http://genome.cse.ucsc.edu/cgi-bin/das/',-dsn=>'hg18');

my $snp=47310331; my$chr = "chr11";

my @exon_start=

(47309532,47309971,47310198,47310692,47310940,47311320,47311683,47312048,47313168,47314003,

47315517,47315816,47316646,47317450,47317777,47319129,47319264,47320117,47320704,47320956,4

7321147,47321618,47324333,47324753,47325548,47325777,47325983,47326550,47327900,47328140,47

328628,47329365,47330749);
my @exon_end=(47309843,47310008,47310385,47310829,47311100,47311460,47311879,47312137,47313336,47314138,

47315706,47315921,47316806,47317531,47317917,47319159,47319371,47320283,47320871,47321062,4

7321275,47321751,47324497,47324771,47325606,47325807,47326032,47326668,47328049,47328239,47

328742,47329632,47330829);

my $exon_count=33; for (my$i=0;$i<$exon_count;$i++){ my$segment = $das->segment("$chr\:$exon_start[$i]\,$exon_end[$i]");
my $dna =$segment->dna;

my $aa_seq = Bio::Perl::translate($dna);
$protein.=$aa_seq->seq;

}
print \$protein ;


My output:

VSSL*LHLSFIAQ*TLGRHSRPERPVPRHCFLRPPSFYPKDPGASFRSPVDQSVQHPLRTARQLPC*SPIAAQETHLSHIHPTVGRGFPNF

PPGSWHGAGIRLYPGHPQEPAWSLRPRTSRRHSHRASPCKLVALQT*MPPSKGQGFLISRVNTPCLLNMRKRASSPRSRPFLNQEILGPWG

YPGQHSRA*QCSPR*PSGSPGAG*SLGPRRSPGPYSWVAHR*CPGLGIKTGSLVVAALSLKPTIFWLKTRK**PLPMMSSGTTQWVRR*CS

KTVNHSPWSSCRLSALCTPRAPCCRHPGVASTPEPH*DPRRQSPGDPGEDLACPTTCSTSVALSSMFSMRTVTW*VPECTRRAARMNRMVS

VGLLRMLTSSPARGCPSLVQVT*GRGLPPGKG*EGSQAPRPSSEWSGAGAWAAEAVAVRRISCTVTGSVVVTGAPGPAILCARTRKSSRAP

VGRSFTSIDVCSVSPCRAATHSEQPSGQYSTL*PSRPPAPTRSGGRHLRETVVSETSSTARWVGSLGGPIGMKGWEAGLGLDMPMALTA*T

RISYTTPSIMRRAS*LSS*IRSKFSRIHR*LFFFLRSRM*PRMGCPPS*AGGSHCTVQESSPTLLILGAAGASGTPMTLTVRLTWSSPTGF

FTVTV*TPSSSFSAPSTVKMLRSLVVSTRTRPSVSHSPSCQTPTHRCHLCPLGHLGLAWLGPYSPA**PSARSQWEQGPQR*GRPDVAYFQ

ILI*DLTFEHRLVLCAHHQVCDALVHLQLLPCTMSSASAWPPLVHSA*CPASSSMASLMIRWCLCPSFLNRYLKVSSRVSSTPSFSPFDLR

PLLRYFTLKLHPLPHHHQLVLQGARDEHRGLPFTKSSVLHFSPPTTHW*AASSANEHWLMVRVRLAPMDSKMYLPAAHLDLLAILEPFDLS

VMVSQFHGQPDLVAFAHLVGRLQLLLKGPVLFFSSRLMPLSLFSMPRRSVTPYWKAMRSYSDGGACRRISHTSSSAGASSFESPRGPETLT

SFSAVS*SPESQCPHGYPDPPTTSSQARAAEG**EVQVSGAHGLPWTVRLKLEQSNLSLVDTSQR*LPVKAGWASVMCSSNR*TPCWRGRS

CRAAAGAGPPCCSGPPICP*TT*PQAASGGWRRPRGLRR*CCHPTVTSPSCGRITKRPMGSSGAPGVGPLRAAELDPLGLGALSPSSAAGA

GASPGAPVASAGAGAGASMGSAFPASMTLRSNLTLEEPAMTA*DPWSAGPTSRTVSVCRVPSVARPYLLLALMSLPPRCQRTFTPARSVSA

SNTAGLPAATSTDRGFLLKAEPGFFPGSGILRDVTPGTKQAQVTQRGT


the sequence provided by ncbi

translation="MPEPGKKPVSAFSKKPRSVEVAAGSPAVFEAETERAGVKVRWQR
MLAPAPAPAEATGAPGEAPAPAAELGESAPSPKGSSSAALNGPTPGAPDDPIGLFVMR
VYLFELHITDAQPAFTGSYRCEVSTKDKFDCSNFNLTVHEAMGTGDLDLLSAFRRTSL
AGGGRRISDSHEDTGILDFSSLLKKRDSFRTPRDSKLEAPAEEDVWEILRQAPPSEYE
EPPVLITRPLEDQLVMVGQRVEFECEVSEEGAQVKWLKDGVELTREETFKYRFKKDGQ
LSAKLHFMEVKIDFVPRQEPPKIHLDCPGRIPDTIVVVAGNKLRLDVPISGDPAPTVI
WQKAITQGNKAPARPAPDAPEDTGDSDEWVFDKKLLCETEGRVRVETTKDRSIFTVEG
AEKEDEGVYTVTVKNPVGEDQVNLTVKVIDVPDAPAAPKISNVGEDSCTVQWEPPAYD
GGQPILGYILERKKKKSYRWMRLNFDLIQELSHEARRMIEGVVYEMRVYAVNAIGMSR
PSPASQPFMPIGPPSEPTHLAVEDVSDTTVSLKWRPPERVGAGGLDGYSVEYCPEGCS
EWVAALQGLTEHTSILVKDLPTGARLLFRVRAHNMAGPGAPVTTTEPVTVQEILQRPR
LQLPRHLRQTIQKKVGEPVNLLIPFQGKPRPQVTWTKEGQPLAGEEVSIRNSPTDTIL
FIRAARRVHSGTYQVTVRIENMEDKATLVLQVVDKPSPPQDLRVTDAWGLNVALEWKP
FSDRAATTKEPVFIPRPGITYEPPNYKALDFSEAPSFTQPLVNRSVIAGYTAMLCCAV
RGSPKPKISWFKNGLDLGEDARFRMFSKQGVLTLEIRKPCPFDGGIYVCRATNLQGEA
RCECRLEVRVPQ"

snp perl translation exon • 3.7k views
2
Entering edit mode

If you prefer not to reinvent the wheel, you can use snpEff snpeff.sourceforge.net/ (and also ENSEMBL has a web based tool). Answering your question, you are missing 5'UTR, 3'UTR and strand information in your analysis

0
Entering edit mode

Hi Abdel. There are formatting options that you should use so that your question, and mostly your code, can be readable. This will ensure that more people are interested in helping you. Cheers

0
Entering edit mode

Merci bcp Pierre :)

0
Entering edit mode

geneName name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds

0
Entering edit mode

If you prefer not to reinvent the wheel, you can use snpEff http://snpeff.sourceforge.net/ ENSEMBL has a web based tool.

Answering your question, you are missing 5'UTR, 3'UTR and strand information in your analysis.

4
Entering edit mode
11.3 years ago

Rather than exons, I would use the mRNA sequences, and as Pierre has stated, know where to begin and end your translation - at the appropriate ATG (AUG) and termination codons.

Now, let's bring up the issue of linkage disequilibrium (LD). Let's say you have two SNPs in the mRNA of interest and both alter the translation (protein). To make things easy, we'll say that at SNP 1 (pos. 100 in the protein), 60% of the alleles encode isoleucine (Ile) and 40% encode threonine (Thr, changing ATT to ACT). SNP 2 is similar (at pos. 210 in the protein), where 60% of the alleles encode phenylalanine (Phe) and 40% encode tyrosine (Tyr, changing TTT to TAT). So, there are four different protein sequences, right? One with Ile @ 100 and Phe @ 210, one with Ile @ 100 and Tyr @ 210, one with Thr @ 100 and Phe @ 210 and finally one version with Thr @ 100 and Tyr @ 210? Maybe not. If these two variants are in very strong LD, where there is very little to no recombination between them, then we'd expect only two different versions of the translated protein to exist in a population. Those would be Ile @ 100 with Phe @ 210, and Thr @ 100 with Tyr @ 210. The allele frequencies and degree of LD tell me this. (This is a topic covered in other BioStar questions and responses.)

So, not all combinations of alleles seen in a database will translate into protein sequences likely encountered in a population.

3
Entering edit mode
11.3 years ago

using the exon boundaries is not enough.

You need to know where the translation stops and starts (that is too say, where is the ATG in your mRNA) .

As far as I understand your code, you're starting your translation from the 5' UTR .

And beware the strand of your exons. MYBPC3 is on the reverse strand.

0
Entering edit mode

In fact I'm using the ucsc hg18 refFlat table to know if my snp is exonic or not , if yes I can have this kin of information about the touched gene : http://genome.csdb.cn/cgi-bin/hgTables?hgsid=5&hgta_doSchemaDb=hg18&hgta_doSchemaTable=refFlat

0
Entering edit mode

geneName name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds

2
Entering edit mode
11.3 years ago
Kevin ▴ 640

If you are doing this for a whole bunch of SNPs, you may wish to try annovar

0
Entering edit mode

+1, annovar is great