Question

Transcript regions in annotation data

0

Entering edit mode

6.6 years ago

usman.enam • 0

Hi,

I am trying to create a representation of the alignment of RNA binding proteins to the transcript regions of mRNA (5' UTR, CDS, 3' UTR) using data from ENCODE. To do this I am aligning the data with an annotation file that I have (which I've borrowed from someone else) and am confused by the directionality and hence regional alignment of the proteins.

For example, here are 2 lines from the annotation:

#name   chrom   strand  txStart txEnd   cdsStart    cdsEnd
uc031pju.2  chr1    +   925740  944581  925941  944153
uc001abz.5  chr1    -   944203  959290  944693  959240

So a read that is from the CDS would obviously be anywhere between 925941 and 944153 in the plus strand and between 944693 and 959240 in the minus strand.

But are the following statements correct? And if not can you correct them? In the case of the plus strand: 1) a read that is less than cdsStart but greater than txStart is 5' UTR 2) a read that is greater than cdsEnd but less than txEnd is 3' UTR

In the case of the minus strand: 1) a read that is less than cdsStart but greater than txStart is 3' UTR 2) a read that is greater than cdsEnd but less than txEnd is 5' UTR

I guess I am kind of confused about what the position numbers in the annotation really mean? Where do they come from?

Thanks in advance

sequence genome e-clip annotation • 1.0k views

ADD COMMENT • link 6.6 years ago by usman.enam • 0