Question

Why an extremely large number of reads are mapped to several chromosomes in RNA-seq data?

1

Entering edit mode

18 months ago

biock ▴ 60

Hi, I'm analyzing several RNA-seq samples downloaded from ENCODE. I found in some BAM files an extremely large number of reads were mapped to several chromosomes or patches. For example, in ENCFF754JEN, there are about 68M, 35M, and 34M reads in chr21, chr22_KI270733v1_random, and chrUn_GL000220v1, respectively, while the largest chromosome chr1 has only about 19M reads?

I wonder why the reads are unevenly distributed in different chromosomes? What should I do to avoid getting biased results from these BAM files? Thanks!

$ samtools idxstats ENCFF754JEN.bam | cut -f 1,3 | sort -k2,2nr
chr21   68288192
chr22_KI270733v1_random 35290528
chrUn_GL000220v1    33978632
chr1    18807868
chr6    11533310
chr11   11474490
chr19   11069604
chr12   9934778
chr2    9306572
chr17   9227112
chr7    8790550
chr16   8079172
...

ENCODE RNA-seq • 806 views

ADD COMMENT • link updated 18 months ago by benformatics 3.9k • written 18 months ago by biock ▴ 60

score 4 · Answer 1 · 2022-10-25

My first thought would be that you have rRNA contamination. Chr 21 and 22 both have rRNA genes. For what it's worth they are also highly heterochromatic and have lots of repetitive DNA. Retrotransposons like AluI can be transcribed, so conceivably could be in the RNA population

However, the first thing to check is rRNA. Normally that gets mostly eliminated during library preparation, but if there was a problem at that step then most of your reads would be expected to be rRNA, just because it constitutes the overwhelming majority of the total RNA population.

If it's not rRNA I'd be interested in knowing what it is.