How to interpret one region of gene with much higher RNASeq coverage than other regions?
0
0
Entering edit mode
3.1 years ago
CephBirk ▴ 20

Hello all,

I have a 3kb gene sequence for which I am aligning my 150 bp RNASeq reads with bowtie2. There are no known introns in the gene. The first 90% of the sequence has a relatively consistent coverage of 50-200 reads, but the last 300 bp or so has 10x the coverage as any of the rest of the gene.

My first suspicion was that this sequence may be duplicated elsewhere in the genome and thus reads from another genomic region are spuriously aligning to the gene I'm looking at. However, BLASTing this 300 bp sequence to the genome or to NCBI's full database results in only matches to my gene of interest.

The full gene has 41% GC, while the last 300 has 36% GC. This doesn't seem too terribly different to cause such an effect...

What are other likely explanations when you see this kind of heterogeneity in coverage?

Looking forward to learning from you all.

RNA-Seq • 780 views
ADD COMMENT
1
Entering edit mode

What library kit was used, and do you see this bias for every gene?

ADD REPLY
0
Entering edit mode

It is poly-A selection. And the RIN value was only 4. So is this degradation?

ADD REPLY
0
Entering edit mode

That RIN value is really low, so I suspect that the RNA is fairly degraded. If it is, and you are poly-A selecting, that means you may be losing the original 5' half of many RNA molecules.

A good way to check this would be to align your reads to the entire genome as GenoMax suggested (which is good practice anyway), and then check whether this pattern persists for other genes.

ADD REPLY
1
Entering edit mode

Are you aligning using a reduced reference (i.e. just this gene or some smaller fraction genes)? If the data is from whole genome then you should align to full genome/transcriptome.

ADD REPLY
0
Entering edit mode

GenoMax, yes I am aligning to just a few genes. I will try with the whole genome. So that I can learn from this, would you mind helping me understand the problem with this strategy?

ADD REPLY
1
Entering edit mode

NGS aligners are designed to be greedy. They will try to align reads to their best ability. If your data is from whole genome and you use a reduced reference then there is always a possibility that reads may be aligned to regions where they may not have originated, simply because of sequence homology (e.g. think of a motif that is common).

ADD REPLY
0
Entering edit mode

This makes good sense and thank for taking the time to help me understand the best practices and why they are valuable. This is why I had BLASTed portions of this gene to the whole genome to see if perhaps there was an area of high sequence homology. Since nothing matched well enough by BLASTing with default parameters, is it still likely to think that the NGS aligners would pick up something from elsewhere in the genome?

ADD REPLY
1
Entering edit mode

Could you please describe your problem a bit more detailed:

What organism are you dealing with - is it an eukaryotic one? - Does your gene have shorter isoforms?

What kind of library prep was used to generate the data - FFPE-based, targeted, 3'Seq, whole transcriptome?

ADD REPLY

Login before adding your answer.

Traffic: 2633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6