Question: High Duplication and low coverage RNASeq?
0
gravatar for Sharon
3 months ago by
Sharon320
Sharon320 wrote:

Hi All

If I have more than 40% are marked duplicate in my RNASeq and discarded by MuTect2 and another 40% not primary alignments.
This is how the duplication looks from fastqc: https://ibb.co/hEPoJH The figure says 43% seq will remain if deduplicated.

I make a coverage profile and it is here, the genome hg38 assembly is covered only by ~17%. https://ibb.co/d3pr5x This is a sample of the distribution I draw its curve:

0 2644954237
1 86037238
2 173531033
3 38889212
4 77790438
5 23056365
6 37055662
7 14282143
8 19123921
9 9226812
10 10713451

All these bases (2644954237) are covered by zero reads ! I mean I count how many reads cover each base in the reference. I feel this is very low coverage and very high duplication, is this normal with RNASeq or can be explained?

ADD COMMENTlink modified 3 months ago by Devon Ryan80k • written 3 months ago by Sharon320

Have you read this post from author of FastQC?

If data is from a patterned FC then you should start looking to see if you can discern if the duplication is optical or due to excessive amplification. You can use clumpify.sh from BBMap as noted in this post for that: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

genome hg38 assembly is covered only by ~17%

I don't remember what part of genome is coding but since this RNAseq that number may be appropriate.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax49k

No, I did not, thanks for sharing with me, will try BBMap now. Thanks a lot GenoMax.

ADD REPLYlink written 3 months ago by Sharon320

Remember to use original raw data when you try to estimate duplication sources. Trimmed/otherwise manipulated will throw off your analysis.

ADD REPLYlink written 3 months ago by genomax49k

Ok, thanks. You also mean 17% coverage only in RNASeq is normal?

ADD REPLYlink written 3 months ago by Sharon320

Can you confirm that by your observation 17% of the whole genome (out of ~3 gb) is covered by at least one read in this dataset?

ADD REPLYlink written 3 months ago by genomax49k

Yes, 17% of the bases of the genome is covered by at least one read.

ADD REPLYlink written 3 months ago by Sharon320
2
gravatar for Devon Ryan
3 months ago by
Devon Ryan80k
Freiburg, Germany
Devon Ryan80k wrote:

17% coverage actually seems a bit high, though I guess I've never actually checked this. Keep in mind that protein-coding genes only occupy ~2% of the genome and they're most of what you're sequencing (presuming you're doing standard poly-A enrichment). 40% non-primary alignments is high for RNAseq data, at least for what I deal with. A high duplication rate like you're seeing is expected. After all, you're going to have VERY high coverage of highly expressed genes and many of those reads will be falsely labeled as PCR duplicates. For what it's worth, this is why one doesn't mark duplicates in RNAseq data.

Be careful when doing variant calling from RNAseq data. Variant callers aren't typically written to consider things like allele-specific expression, >1000x differences in expected coverage, or RNA editing.

ADD COMMENTlink written 3 months ago by Devon Ryan80k
1

1-2% coding genes and 8-9% regulatory content (not sure if all of this would be expressed) seems to be the recent estimate I could find.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax49k

Thanks GenoMax so much. I think it becomes more clear now :)

ADD REPLYlink modified 3 months ago • written 3 months ago by Sharon320

Thanks Devon so much. Appreciated. Do you mean I should not remove duplicates in variant calling RNASeq? Also Do you have any recommendation on how "careful", I mean I am following GATK pipleine for somatic mutations, do u recommend something else? I am very new to this. Thanks

ADD REPLYlink written 3 months ago by Sharon320
1

I think GATK had a beta RNAseq variant pipeline for a while, have a look at that.

ADD REPLYlink written 3 months ago by Devon Ryan80k
1

While you should not remove PCR duplicates if this is data from HiSeq 3000/4000 or NovaSeq you should check for optical duplicates. Those may need to be removed, if you have a lot of them.

ADD REPLYlink written 3 months ago by genomax49k

Thanks Devon and GenoMax so much, very much appreciated :)

ADD REPLYlink written 3 months ago by Sharon320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1847 users visited in the last hour