Why an extremely large number of reads are mapped to several chromosomes in RNA-seq data?
Entering edit mode
3 months ago
biock ▴ 60

Hi, I'm analyzing several RNA-seq samples downloaded from ENCODE. I found in some BAM files an extremely large number of reads were mapped to several chromosomes or patches. For example, in ENCFF754JEN, there are about 68M, 35M, and 34M reads in chr21, chr22_KI270733v1_random, and chrUn_GL000220v1, respectively, while the largest chromosome chr1 has only about 19M reads?

I wonder why the reads are unevenly distributed in different chromosomes? What should I do to avoid getting biased results from these BAM files? Thanks!

$ samtools idxstats ENCFF754JEN.bam | cut -f 1,3 | sort -k2,2nr
chr21   68288192
chr22_KI270733v1_random 35290528
chrUn_GL000220v1    33978632
chr1    18807868
chr6    11533310
chr11   11474490
chr19   11069604
chr12   9934778
chr2    9306572
chr17   9227112
chr7    8790550
chr16   8079172
ENCODE RNA-seq • 316 views
Entering edit mode
3 months ago

My first thought would be that you have rRNA contamination. Chr 21 and 22 both have rRNA genes. For what it's worth they are also highly heterochromatic and have lots of repetitive DNA. Retrotransposons like AluI can be transcribed, so conceivably could be in the RNA population

However, the first thing to check is rRNA. Normally that gets mostly eliminated during library preparation, but if there was a problem at that step then most of your reads would be expected to be rRNA, just because it constitutes the overwhelming majority of the total RNA population.

If it's not rRNA I'd be interested in knowing what it is.

Entering edit mode

chrUn_GL000220v1 contains a complete 45S rRNA gene so I would guess you are right on point


Login before adding your answer.

Traffic: 2554 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6