Question: Rsamtools-Countoverlaps
1
gravatar for Cdiez
7.2 years ago by
Cdiez150
Cdiez150 wrote:

Hi, I'm a beginner analyzing genomic data and using R and I have a doubt that I don't know how to solve. I'm using Rsamtools to analyze mRNAseq_data but once I have the reads covering each gene I would like to add the information about their chromosome, start and stop of each gene. I have downloaded the gene information from the UCSC browser as a RangeList following this command line:

library(GenomicFeatures)
txdb = makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')

I have tried to coerce the RangeList into a data frame in order to get this "extra data" but the number of observations changes so I don't know how to extract from the RangeList the information I need of each gene.

Right now I have this data frame:

Gene          Reads

AGR37373.F06    5

AGR37273.F04    0

And I would like to have this:

Gene          Chr     Start    Stop   Reads

AGR37373.F06    1      20       150     5

AGR37273.F04    1      300      410     0

I'd be very grateful if someone could give me any advice about how to solve this problem.

Thanks!

mrna R samtools • 3.1k views
ADD COMMENTlink modified 7.2 years ago by Michael Dondrup44k • written 7.2 years ago by Cdiez150

Did you try to merge the two dataframes? read ?merge and merge on the the geneid.

ADD REPLYlink written 7.2 years ago by Michael Dondrup44k
1
gravatar for Ido Tamir
7.2 years ago by
Ido Tamir4.9k
Austria
Ido Tamir4.9k wrote:

My guess is you do something like:

library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')
aligns <- readBamGappedAlignments("my.bam")
counts <- countOverlaps(transcripts(txdb), aligns)
names(counts) <- elementMetadata(tx)[,"tx_name"]

so change the last line to and following:

 elementMetadata(txdb)[,"my.counts"] <- counts
 df <- as.data.frame(txdb)

and voila you have a data frame with all genomic positions and metadata.

However I would suggest you to look at htseq-count because it has multiple resolution rules for situations that a simple count does not answer. call it from R with system("htseq-count ...") and read in the resulting data.frame.

ADD COMMENTlink written 7.2 years ago by Ido Tamir4.9k
0
gravatar for Cdiez
7.2 years ago by
Cdiez150
Cdiez150 wrote:

Hi! Thanks a lot for your useful comments!

I have tried the two suggested methods but I have not had any successful result :(

  1. Yes, you are right (Ido Tamir), I was trying this command line:

library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene') aligns <- readBamGappedAlignments("my.bam") counts <- countOverlaps(transcripts(txdb), aligns) counts.dframe <- data.frame(counts, stringsAsFactors=FALSE) rownames(counts.dframe) <- names(txdb)

Following this command line I obtained the number of reads overlapping each gene:

CC209858.4_FG002 214 CC209858.4_FG003 0 CC209858.4_FG004 17 CC209859.2_FG006 0 CC209859.2_FG008 0

Now I wanted to add an extra column with the information stored in the Metadata(txdb). I tried the two suggested options suggested command lines but I got an error:

rownames(counts.dframe) <- elementMetadata(txbd)[,"tx_name"]

Error in elementMetadata(txbd)[, "tx_name"] : selecting cols: cannot subset by character when names are NULL

elementMetadata(txbd)[,"my.counts"] <- count.dframe

Error in [<-(*tmp*, , "my.counts", value = list(counts = c(1L, : replacing cols: cannot subset by character when names are NULL

  1. About trying to merge both files. The problem to merge the two files(counts and genes) using the gene ID is that I have to coerce into data.frame both files, and when I do it the number of rows of the txbd.dataframe doesn't correspond with the number of observations this file has as GRangeList (I don't get why..). Then, I can not merge the two datafrmes.

i.e:

Original files Files coerced into data frame

counts (interger[110185]) counts.dframe (110185 obs. of 1 variables) txbd (GRangesList of length 110185) txbd.dframe (392978 obs. of 8 variables)

Then, I don't know how I can access to the information store in the metaData and use it to complete the information about the reads overlapping each gene. I just thinking about to export the files and write and script in Perl to do it, but should be a way to do it in R (I guess...).

Thanks!

ADD COMMENTlink written 7.2 years ago by Cdiez150

elementMetadata(txbd)[,"my.counts"] <- count.dframe

should be - I did not convert the counts vector into a data.frame:

elementMetadata(txbd)[,"my.counts"] <- count.dframe$Reads

I don't think countOverlaps is appropriate. Try htseq.

ADD REPLYlink written 7.2 years ago by Ido Tamir4.9k
0
gravatar for Michael Dondrup
7.2 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

The number of rows doesn't have to be equal

Try:

merge(counts, meta, all=TRUE, by.x=1, by.y=1) # set by.x by.y to the column number with the Geneids

That way you will get a full outer join like in SQL with NAs inserted in rows that don't in exist in one of the files.

ADD COMMENTlink written 7.2 years ago by Michael Dondrup44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1853 users visited in the last hour