Question

Map RNA-Seq reads in Genomic Regions of Interest (eg 5'UTR)

0

Entering edit mode

7.3 years ago

Walter Baumann ▴ 10

Hi,

I'm a self-learner in RNA-Seq Data analysis so sorry in case this question is basic.

I have a Bam file with an Alignment of RNA-Seq data and the human genome (GRCh38.p7; from Gencode). Now I want to search those reads that map in the 5'UTR and 3'UTR of each transcript.

I already read a lot, that one can calculate the Ranges of the UTRs oneself in R (with GenomicRanges; there was a post before, unfortunately the example was not fully reproducible) and you can download the file from the UCSC. I guess here I can download the bed file containing those information:

http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=579355771_AZVl05hD8wxt47Spc1c92Baapc0G&clade=mammal&org=Human&db=hg38&hgta_group=genes&hgta_track=mrna&hgta_table=0&hgta_regionType=genome&position=chr9%3A133252000-133280861&hgta_outputType=bed&hgta_outFileName=

But how can I procede then? Do I then have to continue to work with bedtools?

Thanks a lot!

RNA-Seq rna-seq R next-gen sequencing • 3.4k views

ADD COMMENT • link 7.3 years ago by Walter Baumann ▴ 10

3

Entering edit mode

I want to search those reads that map in the 5'UTR and 3'UTR of each transcript.

It's unclear what the desired output is of this analysis. Do you want to filter your bam file? Get the read IDs of those in the UTR? Count reads in UTRs?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

I want to get the Read IDs of those in the UTR (two seperate files, one for 5' the other for 3'), but also count the reads. That map in each region.

ADD REPLY • link 7.3 years ago by Walter Baumann ▴ 10

score 1 · Answer 1 · 2017-02-15

For counting, you would need a gtf of UTRs, which you can probably obtain easily from UCSC or Ensembl (maybe by downloading the entire gtf and filtering out the UTRs). Then you would use one of the commonly used count-tools such as featureCounts or htseq-counts to generate counts per element.

To get the read IDs themselves you would filter your bamfile using bedtools with the same gtf, and then slice out the read IDs (probaby just using cut).

score 0 · Answer 2 · 2017-02-15

Additional information: This is a R-function I found on biostars. However, the last two lines give me the output (see below). Everything works well, however, the last line gives this error:

Error in .pcompare_GenomicRanges(x, y) : 
  the 2 objects to compare have seqlevels in incompatible orders

This here is the open source code that seems to generate this error, unfortunately I don't understand why this error is produced.

However, I think downloading the file on the UCSC browser is much more convenient, if the output is the same.

# The gtf file was downloaded here: http://www.gencodegenes.org/releases/current.html, Release 25 (GRCh38.p7)

# Just transcripts
genes <- read.table("/path/to/file.gtf", sep = '\t', skip = 5, stringsAsFactors = FALSE)
whichCodingTranscripts <- genes[, 3] == "transcript" & grepl("transcript_type protein_coding", genes[, 9], fixed = TRUE)
proteinTranscripts <- genes[whichCodingTranscripts, ]

# + or - 
strands <- proteinTranscripts[, 7]

# Split the entries in the ninth column after each ";" 
allFeaturesTranscripts <- gsub("transcript_id ", '', sapply(strsplit(genes[, 9], "; "), '[', 2))

# Return a vector of all Transcript names
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]

# Both lines now will search for the protein coding part of all transcripts 
whichCDS <- genes[, 3] == "CDS" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsCDS <- genes[whichCDS, ]



# contains information on all CDS per transcript 

transcriptsCDS <- split(GRanges(transcriptsCDS[, 1], IRanges(transcriptsCDS[, 4], transcriptsCDS[, 5]), transcriptsCDS[, 7]),
                        factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))


# Both functions search for the first and the last CDS on the transcript; pick either first or last CDS per transcript  
## 
firstCDS <- mapply(function(CDS, strand) {
  if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands, SIMPLIFY = FALSE)
lastCDS <-  mapply(function(CDS, strand) {
  if(strand == '+') {CDS[length(CDS)]} else {CDS[1]}}, transcriptsCDS, strands, SIMPLIFY = FALSE)

firstCDS <- GRangesList(firstCDS)
lastCDS <- GRangesList(lastCDS)


# Now repeat the "workflow described above for UTRs
## The knowledge on the CTR is needed for 
whichUTR <- genes[, 3] == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR[, 1], IRanges(transcriptsUTR[, 4], transcriptsUTR[, 5]), transcriptsUTR[, 7]),
                        factor(allFeaturesTranscripts[whichUTR], levels = names(firstCDS)))


# Here are the problems
transcriptsUTR5 <- mapply(function(UTR, CDS, strand){        
  if(strand == '+') {UTR[UTR < CDS[1]]} else {UTR[UTR > CDS[length(CDS)]]}}, 
  transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)


transcriptsUTR3 <- mapply(function(UTR, CDS, strand){        
  if(strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]}, 
  transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)