How to interpret the exon numbering for minus strand?
1
1
Entering edit mode
4.0 years ago

Hi, I just downloaded the hg38.ncbiRefSeq.gtf annotation file from UCSC and I'm looking at the first few lines of it.

chr1    ncbiRefSeq  exon    11874   12227   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018.2"; exon_number "1"; exon_id "NR_046018.2.1"; gene_name "DDX11L1";
chr1    ncbiRefSeq  exon    12613   12721   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018.2"; exon_number "2"; exon_id "NR_046018.2.2"; gene_name "DDX11L1";
chr1    ncbiRefSeq  exon    13221   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018.2"; exon_number "3"; exon_id "NR_046018.2.3"; gene_name "DDX11L1";

So, since the gene DDX11L1 is on the plus strand, I interpret this so that its exon1 spans from position 11874-12227, its exon2 from 12613-12721 and its exon3 from 13221-14409, correct? Meaning that the bases in between these positions correspond to the introns 1 (12228-12612) and 2 (12722-13220), right?

So far, so good. But now when I look at a gene that is located on the minus strand, for example WASH7P (the very next gene in the file):

chr1    ncbiRefSeq  exon    14362   14829   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540.1"; exon_number "1"; exon_id "NR_024540.1.1"; gene_name "WASH7P";
chr1    ncbiRefSeq  exon    14970   15038   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540.1"; exon_number "2"; exon_id "NR_024540.1.2"; gene_name "WASH7P";
chr1    ncbiRefSeq  exon    15796   15947   .   -   .   gene_id "WASH7P"; transcript_id "NR_024540.1"; exon_number "3"; exon_id "NR_024540.1.3"; gene_name "WASH7P";

(I just show the first 3 exons here)

I understand that the positions shown here actually show the end and than the start of the exon, because its on the minus strand, right? So, position 14362 is the LAST base of exon1 and 14829 is the FIRST base of exon1, correct? For exon2 the FIRST base is 15038 and the LAST one is 14970, right?

So, in the mRNA resulting from joining all the exons here, wouldn't what is called "exon1" in WASH7P actually be the LAST exon in the mRNA? Not the first one? Why is it called exon ONE in the file? Wouldn't the exon order be reversed for genes located on the minus strand if their START and END positions are reversed? On the minus strand, I would expect exon n+1 to be located UPSTREAM of exon n, not downstream? This is confusing af.

Am I interpreting anything wrong?

genome sequence gene • 1.3k views
ADD COMMENT
1
Entering edit mode
4.0 years ago

Yes, the exon marked as exon_number=1 is in fact the last exon, since we normally just think in terms of the + strand and then reverse complement things if they're on the - strand. The last exon is numbered as the first because of how these files are made (from the first base to the last, so things like the exon number are relative to the + strand). Just ignore the exon number field, it has no real meaning and is mostly there because it makes it convenient when programming.

ADD COMMENT

Login before adding your answer.

Traffic: 1765 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6