I performed RNA sequencing using a poly A 3' tagging/sequencing approach. I therefore expect the sensible reads to map to only the 3' end of the transcripts in my sample. I want to subset the GENCODE gene definitions to only include the UTR + 1kb of exon. What is the best way to do this? I had the following ideas:
1) "grep" the GENCODE definition files for "UTR' lines, then find exons whose coordinates are immediately adjacent, and keep going backwards (to get more neighboring exons) until i get my 1kb.
2) "grep" the GENCODE files for "stop_codon" lines, then keep getting exons whose coordinates are immediately adjacent to the "stop_codon" coordinates, until I get my 1kb.
3) find the "transcript" lines of the GENCODE file, try to match them to the "UTR" lines, then select the last 1kb of the 'transcript' definitions (and add on the coordinates for the UTR).
besides trying to figure out what the best way to get these 3' end coordinates, I also had the following question:
1) should all transcript definitions have a "UTR" line in the GENCODE definition files? 2) should all "UTR" definitions have adjacent "exons"?