GTF2/GFF3 "feature" types and expression analysis
2
2
Entering edit mode
10.0 years ago
ndejay ▴ 30

Hi,

I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is. Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time).

Use of the gtf2 file produces a file containing read counts per exon. I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here).

My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file?

My second question follows

I have read both these resources (here and here) but do not understand the differences between:

  • exon vs CDS
  • transcript vs mRNA

I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output. What must I do for isoforms? I am confused by the semantics.

Thanks ahead of time and let me know if my post was not clear enough.

gtf RNA-Seq counts bedtools gff • 4.7k views
ADD COMMENT
1
Entering edit mode

Is there any particular reason you're trying to do this with bedtools? Using htseq-count or featureCounts would make your life easier for gene-level analyses. Also, you generally don't want to count reads for looking at transcript expression level, but rather use an expectation-maximization alogrithm (normally in the form of cufflinks, rsem, eXpress, etc.) to assign full and fractional reads, since otherwise you'll get almost 0 unique reads (or you end up double/triple/etc. counting half your reads).

ADD REPLY
0
Entering edit mode

I was hoping to obtain gene and isoform levels in a single step. In my experience htseq-count has proven to be quite slow so I decided to do things differently, that is using bedtools, and getting the most out of it. Why isn't counting reads sufficient? I am planning to analyze all of this using programs like DESeq. I'd have at least two files per sample : gene-only read counts and isoform-only read counts.

ADD REPLY
1
Entering edit mode

The slowness of htseq-count can be remedied by simply using featureCounts instead (it's MUCH faster). I'm sure you can use bedtools for looking at gene-level read counts if you muck around enough, but that's not exactly the path of least resistance. While you can attempt to look at isoform usage in DESeq (or DESeq2) the output would be completely meaningless and you would have just wasted large amounts of your time. Count-based methods are simply useless in looking at isoform usage (though you CAN look at exon usage, see the DEXseq package).

ADD REPLY
0
Entering edit mode

Thanks for your words of wisdom. I will look at featureCounts and DEXSeq. Sorry for being a pain, but can you explain (or point me to resources explaining) why count-based methods won't work for isoform usage? Why won't bedtools do the trick? I currently have a the read counts of every gene as per the gff file -- is it related to the ambiguity or multimapping or because htseq-count is more stringent on its counting?

ADD REPLY
1
Entering edit mode

It's related to the multiple assignment issue (you might mean this by "multiple mapping", though that has a slightly different meaning in this context). I vaguely remember some of this being discussed in the cufflinks2 paper from Lior Pachter. The general issue is that isoforms overlap almost completely, so by using simple count-based method you will always drastically over/under-estimate actual levels and thereby tank your power. The image below (from the aforementioned paper and Lior's blog, which is a must-read) gives an example of a few count-based methods and how they lend themselves to producing wrong answers.

ADD REPLY
2
Entering edit mode
10.0 years ago

You were almost at the right place, the primary purpose of the website that hosts the GFF3 definition is to define the terminology (ontology) for each term. If you go to the browser and enter exon and CDS you will get the exact specification for each.

http://www.sequenceontology.org/browser/obob.cgi

Exon: A region of the transcript sequence within a gene which is not removed from the primary RNA transcript by RNA splicing.

CDS: A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

As for conversions, the GTF2 and GTF3 files differ only in the information contained in the 9th column.

ADD COMMENT
0
Entering edit mode

Thanks for the prompt reply Istvan! It appears that mRNA is a child of mature_transcript, which in turn is a child of transcript. Does this mean that in order to get transcript-level read counts, I have to select both mRNA and transcripts?

ADD REPLY
1
Entering edit mode

You can read the descrition of the nodes from transcript to mRNA to see the attributes that change and thus make the terms different. But during read counting we use the transcripts only to group exons that express together so usually it does not matter which naming is used.

Long story short you would only need to use one not both. Moreover most often it does not matter which one you choose, it matters later when you try to connect to other sources of information.

ADD REPLY
2
Entering edit mode
10.0 years ago

Regarding mRNA vs. transcript, since Istvan already mentioned exons and coding sequences (CDS), mRNAs are a subset of transcripts that encode proteins. There are many other types of transcripts (tRNAs, miRNAs, lncRNAs, etc.) that are not protein-coding. Depending on the type of library that was sequenced, you may expect to find mostly mRNAs or mostly other forms.

ADD COMMENT

Login before adding your answer.

Traffic: 1983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6