Question

How do you people tell if a gene is really present in the sample after mapping the reads on the gene?

0

Entering edit mode

7.4 years ago

hed.robin ▴ 30

I have a set of gene sequences.
And some sets of reads.
I mapped these reads on the genes that I have and now I'm wondering about the best way to tell whether a gene is really present in the samples that I'm analyzing.
We use read_count /(gene_length*number_of_reads_in_sample) formula to compare the genes, but I'm not sure this is good enough, because for some reason the reads might only map on a certain region of a gene.

So what do you do to check if a gene is really there?

UPD: It is not RNA-seq, I have metagenomic DNA reads, so I want to determine physical presence of a gene, not the expression.

Bowtie sequencing gene • 2.0k views

ADD COMMENT • link updated 7.4 years ago by Matteo Schiavinato ★ 3.6k • written 7.4 years ago by hed.robin ▴ 30

score 2 · Answer 1 · 2016-11-18

If we are talking RNA-seq I'm pretty convinced these days that there is no such thing as "really present" or "really not present".

If you sequence deep enough you'll find that all the genome is transcribed at some level. Genes are not like lights that are either off or one, they are like ambient light levels - some days it is brighter, some days darker, but there is rarely no light, even if it is just star light, or reflected light.
All RNAseq experiments will be contaminated with some level of genomic DNA. Its more or less impossible to guarentee that your sample is completely devoid of this, hoever hard you treat it with DNAse.
As you say, different parts of the gene can be expressed in different samples, or perhaps the signal you are seeing if not from the structure you are expecting.

So the genuine answer to the question "How do you people tell if a gene is really present in the sample after mapping the reads on the gene" is that we can't. However, people continue to want an answer to this question so here are some rules of thumb:

The measurement you are talking (reads/(length*library_size)) sounds very similar to FPKM to me. Many people have used and FPKM of one as an expression cut off. Transcripts per million (TPM) is probably better than FPKM. Given that an average human cell around 200,000 transcripts in it, a TPM of 5 would equate to about 1 transcript per cell.

If you really care, you could select matched intergeneic regions the same size and structure as each of the genes at random and calculate the distribution of expression values for those and look for the genes that did stand out form this negative set. - I know someone who did this and it was not a simple piece of work - you should probably really care about presence/absence before undertaking this.

score 1 · Answer 2 · 2016-11-21

1

Entering edit mode

7.4 years ago

Matteo Schiavinato ★ 3.6k

If you have DNA Reads, then it can happen that your sequencing experiment didn't cover entirely the gene region (for multiple reasons).

You should first check how many reads map on your gene sequence regardless of where they map. This gives you a rough indication of the presence of the gene or at least something very similar (it is very unlikely that reads align to a wrong gene, there must be sequence similarity at least. If there aren't any, then you know it' s not there.

Once you have extracted those reads mapping on your gene of interest, you should check the mapping positions. Since you have a multifasta with gene sequences, you should see a mapping position distribution from 1 to the last (or similar). If you see that the distribution is weird (f.e. only located in one single part) then you know that maybe only that part of the gene is there.

ADD COMMENT • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Well lets say I got some bit numbers in terms of reads count, and then I realize that they are mapping not just a part of a gene, but a very very small portion of a gene that is even less than a read.

So I am wondering, is there any common way to count all of this, like, in one formula? Or I have to invent that myself?

ADD REPLY • link 7.4 years ago by hed.robin ▴ 30

0

Entering edit mode

even less than a read

I am not sure that I understood this point. How can it be "less than a read" when it's a read that you map? Are you mapping locally?

ADD REPLY • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

1

Entering edit mode

yep it's a situation that can happen on the edge of a sequence, if you have quite a short sequence and long reads, so you allow local mapping.

I've solved my problem with filtering by coverage breadth and average coverage depth per base.

ADD REPLY • link 7.4 years ago by hed.robin ▴ 30

score 0 · Answer 3 · 2016-11-18

Hi hed.robin,

First let's clear out something: I guess you are using RNASeq reads to quantify gene expression, am I right? If not, please specify which reads you are using and for which purpose.

I have a set of gene sequences. And some sets of reads.

Second: your "set of genes" is a multifasta containing all the gene sequences, or the transcript sequences? This changes things.

So what do you do to check if a gene is really there?

If you want to check gene expression, aligning reads against the transcript sequences or against the reference genome will generate a BAM file (or PSL if you use old programs) that you can then sort by genomic coordinate and give to a quantification algorithm (for example, HTSeq or Cufflinks). These programs count how many reads map on the gene region which is a first number that tells you if the gene is expressed (I guess you mean expressed when you say "in there").

We use read_count /(gene_length*number_of_reads_in_sample)

This sounds fair enough to me as a normalization rule. Bioinformatics is a dirty science in this context, you can always have situations like the one you described, but you have to look for the big picture. If you want to understand better why people use these normalization calculations, here is a very good article imho:

http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

for some reason the reads might only map on a certain region of a gene

My PhD project is forcing me to deal with this problem. If you know which genes might show this behaviour, I strongly recommend you to do a couple of things: either running cuffdiff providing a GFF3 list of exons as "genes" (I know it's against "the rules", but works) and see for the genes you are interested in, if the expression is only on one exon and not on all the gene. Another way is to use JBrowse ( http://jbrowse.org/ ) to graphically assess where your reads map. You can easily install it in your /var/www directory and load tracks. But maybe try the first one before, it's quick and dirty but gives you an idea.