Hi everyone,
I am working with a triple negative breast cancer cell line and am interested in using associated SNV data. This is my first time working with such data, so I wanted to see if I can get any guidance.
First, is COSMIC for the most part unique compared to dbSNP? There are some SNVs that have COSMIC Ids, but no corresponding dbSNP ID.
Second, I want to make sure that I am being accurate when processing VCF files with VEP. It appears that consistent genome assembly is crucial to accurate results. For example, a specific TP53 mutation with the same transcript ID occurs at 17:7577099..7577099 in GRCh37 but 17:7673781..7673781 in GRCh38.
Lastly, I have used the ProteinSeqs plugin with VEP to identify mutated and reference protein sequences. However, for many entries, the "mutated" protein sequence exactly matches the reference.
Thanks!
Welcome to the world of bioinformatics, where there exist an increasingly large number of databases with (hopefully) well maintained cross-links to other databases, none of which uniquely contain or identify exactly the entity you need. There will be considerable overlap between COSMIC and dbSNPs, based on if the mutation is only observed as somatic or both as somatic and germline (plus dbSNP doesn't restrict itself to germline either).
That's another challenge in the field - the number of reference genomes and where various entities map on them. The field has moved on to GRCh38 for the most part, but 37 is also largely in use. It should not make much of a difference as long as every annotation step matched up to the genome version you wish to use.
Could they be synonymous changes perhaps? A change in the nucleotide/mRNA sequence that yields the same protein change owing to codon degeneracy?