Quality scores in ICGC simple somatic mutation file
1
0
Entering edit mode
6.4 years ago
tralynca ▴ 50

Hi all,

I've recently downloaded the simple somatic mutation (SSM) file for clear cell renal cell carcinoma (ccRCC) from the ICGC Data Repository, but I've been having  some trouble interpreting the quality score column.

Below is  a snippet of my data ( .tsv file)

chromosome    chromosome_start    chromosome_end    chromosome_strand     mutation_type    reference_genome_allele    mutated_from_allele    mutated_to_allele    quality_score    probability    total_read_count

 1    224822287    224822287    1       single base substitution    T    T    G    223        46    26
 1    224822287    224822287    1       single base substitution    T    T    G    223        46    26

However, I'm not sure why the quality score is so high. For every entry the quality score is between 100 and 223. Some have said that Phred scores can in fact range from 0 to infinity (http://gatkforums.broadinstitute.org/discussion/4260/how-should-i-interpret-phred-scaled-quality-scores), while others say that scores in the 200 range probably means that the signal was too low (http://seqanswers.com/forums/showthread.php?t=23770).

The ICGC website has described the quality score column to be that of the mutation call and not that of alignment etc. (http://docs.icgc.org/simple-somatic-mutations-ssm-primary-analysis-file-p).

The rest of the columns say that samtools pileup was  used for the raw variant calls among other analysis algorithms such as GATK, Picard, VCF tools etc. For all calls no verfication with an orthogonal platform or biological validation was carried out.

Can anyone confirm whether this does in fact  infer great quality or if I should be looking out for something else.

Thanks in advance,

 

Tracey

quality score simple somatic mutation ICGC • 2.0k views
ADD COMMENT
0
Entering edit mode
6.4 years ago
Ying W ★ 4.1k

Could you link to the ccRCC SSM file? I looked through the SSM file here: https://dcc.icgc.org/repository/release_18/Projects/RECA-CN and it looks like they used Varscan but it doesn't show the quality scores that you pasted. The quality scores generated by varscan can be found here: http://varscan.sourceforge.net/somatic-calling.html#somatic-output there was a conversion process from varscan output to vcf

ADD COMMENT
0
Entering edit mode

Thank you for your response Ying, but I used the EU/FR data set since they carried out whole genome sequencing (https://dcc.icgc.org/repository/current/Projects/RECA-EU). They used and samtools mpileup for variant calling. Thank you for going through the trouble of pasting the link for the VarScan documentation.

If you also have some experience with samtools, I would be happy to hear your thoughts on the quality scores.

 

ADD REPLY
0
Entering edit mode

tbh i'm not very sure how samtools pileup/mpileup outputs quality values and which one is being used for the ssm file. There are multiple posts on this website asking about samtools/pileup/mpileup and quality values. To go back to your original question, I would assume that the high quality values mean that they are good enough for your purposes since they are being distributed, the lower quality variants were probably filtered. If you don't trust it, you would have to look for the raw data and do variant calling yourself (which you will have to get authorization for since tumor/normal bam files are protected patient data). I was under the impression that the data on icgc website will eventually have normalized variant calling data using the same pipeline.

ADD REPLY
0
Entering edit mode

Hi Ying,

I've gone through the questions about samtools/mpileup but none of them seem to address the issue of the quality score. I did write to ICGC about two weeks ago and again today. I'm awaiting a response. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2249 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6