Question: High percentage of UTR mutations in RNAseq
gravatar for CHANG
4.6 years ago by
United States
CHANG40 wrote:

I noticed a high percentage of 3'UTR mutations human tumor RNAseq. Here is a paper (Fig.3) which shows < 10% of UTR mutations in their RNAseq data.

Is it unusual for such high percentage of UTR mutations? What could be some explanations? I greatly appreciate your feedback.

Here are my methods.
Sample Prep
Tumor samples RNA extraction by TRIzol (Invitrogen) and the RNeasy kit (Qiagen), Illumina HiSeq 2000, 75bp pair-end
GATK RNAseq Best Practice - STAR-2pass with Gencode hg19 transcripts, SplitN'Trim, Indel Realignment, base recalibration.
Mutation calling
Mutect2 with default params, keep "PASS" mutations, annotated with SNPEFF

Mutation Distributions from 32 tumor samples
3'Flank 10.20%
3'UTR 42.79%
5'Flank 1.60%
5'UTR 1.49%
Frame_Shift_Del 0.29%
Frame_Shift_Ins 3.99%
IGR 0.43%
In_Frame_Del 0.05%
In_Frame_Ins 0.16%
Intron 12.62%
Missense_Mutation 16.36%
Nonsense_Mutation 0.24%
Nonstop_Mutation 0.05%
Silent 7.86%
Splice_Site 0.37%
Targeted_Region 1.49%
Translation_Start_Site 0.03%

rna-seq • 1.8k views
ADD COMMENTlink modified 4.6 years ago by Amitm2.0k • written 4.6 years ago by CHANG40
gravatar for Amitm
4.6 years ago by
Amitm2.0k wrote:


As far as I understand, the number of reads mapping to UTR regions and intronic areas can vary depending on not only the origin of the biological material, also on the condition/ treatment if given and also on the library prep quality and RNA integrity.

What you have done none now is of all variants identified, you have binned them in categories of coding/UTR/intronic etc. I would suggest to find callable bases first for each of the class. Maybe use the GATK CallableLoci tool and giving different BED files of only CDS regions, or Intronic regions or the UTR regions, quantify the amount of bases sequenced. And then calculate in each category, for the amount of bases sequenced, how much was the mut. load.

If for any reason, there were (comparatively) more reads coming from UTR region in your library prep than the paper quoted, you might see more mut. in UTR. But normalize by callable bases and things should clear up.

On side note, variants seen in RNA-seq are a mixture of - a) expressed SNVs b) RNA-editing

The latter event is known to effect predominantly UTR regions. So, it makes sense to find more variants in UTR. But you should normalize to see the actual mut. load.

ADD COMMENTlink written 4.6 years ago by Amitm2.0k

RNA-editing isn't that prevalent in humans though. Some of the early papers from a few years ago that showed high levels of RNA editing were later shown to be very flawed. A high proportion of variants seen in RNA-seq are false positives and artefacts introduced during RNA -> cDNA conversion and PCR.

ADD REPLYlink written 4.6 years ago by DG7.2k

Agreed that RNA-editing detection from sequencing data is fraught with false positives.

But the main point of the question here was to determine the distribution of RNA-seq variants across CDS, UTR and intronic regions. In this context, normalization by callable bases is imperative.

ADD REPLYlink written 4.6 years ago by Amitm2.0k

Absolutely agree with you on that

ADD REPLYlink written 4.6 years ago by DG7.2k

Picard's rnaseqmetrics shows that majority of 43% of my reads are in coding and 27% in UTR. It doesn't seem like mapping explains the majority of the mutation.

ll try CallableLoci to see if that will help.

Do you think filtering for false calls in duplicated regions, in homopolymeric regions, or close to splice junctions would be helpful?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by CHANG40

I intersected my mutation list with regions of callable loci. The following are percentages of mutations that are callable. The percentages is high across mutation types. 3'UTR 99.37%
Missense_Mutation 98.91%
Silent 99.50%
Intron 94.8%

Can you elaborate on how to normalize the number of mutations by callable bases?

ADD REPLYlink written 4.6 years ago by CHANG40

hi, The mutations you have called are from GATK pipeline (MuTect) and CallableLoci is again from GATK. So of course all mut. identified would have been Callable at the first place and hence picked by MuTect. Hence ~99% irrespective of UTR/ missense.

But anyways the Picard metrics posted by you says that 43% of aligned bases in the sample are from CDS and 27% from UTR. What I meant by using CallableLoci was to use 3 separate BEDs, one each for CDS, UTR & Introns, and calculate the effective length in each case. The idea being that probably there were more aligned bases in the UTR set.

As per Picard it doesn't seem to be so. Maybe what you are seeing is actual biological effect. Not quite sure here.

ADD REPLYlink written 4.6 years ago by Amitm2.0k

I followed the a paper advice (linked in my post) in removing mismatches within the first 6 bases of 5ʹ read ends due to random-hexamer priming. it cut down a good proportion of the 3' UTR mutations.

Before removing mutations in first 6bp of 5' read
3'UTR 1745 36.18%
Intron 570 11.82%
Missense_Mutation 1195 24.78%
Silent 604 12.52%

After removing mutations in first 6 bp of 5' read
3'UTR 477 18.72%
Intron 308 12.09%
Missense_Mutation 697 27.35%
Silent 364 14.29%

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by CHANG40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1959 users visited in the last hour