I am trying to make sense of and clean up mutation data downloaded from Cosmic's FTP server. Regarding sample identification, their website says:
[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.
Does anybody know if there is any way of finding out whether a sample exists multiple times in the database, but under different IDs? Seems to me like there isn't... and there is also no way to tell roughly how many samples might be duplicates?
Any suggestions would be great. Thanks!