Rsamtools-Countoverlaps
3
1
Entering edit mode
12.7 years ago
Cdiez ▴ 150

Hi, I'm a beginner analyzing genomic data and using R and I have a doubt that I don't know how to solve. I'm using Rsamtools to analyze mRNAseq_data but once I have the reads covering each gene I would like to add the information about their chromosome, start and stop of each gene. I have downloaded the gene information from the UCSC browser as a RangeList following this command line:

library(GenomicFeatures)
txdb = makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')

I have tried to coerce the RangeList into a data frame in order to get this "extra data" but the number of observations changes so I don't know how to extract from the RangeList the information I need of each gene.

Right now I have this data frame:

Gene          Reads

AGR37373.F06    5

AGR37273.F04    0

And I would like to have this:

Gene          Chr     Start    Stop   Reads

AGR37373.F06    1      20       150     5

AGR37273.F04    1      300      410     0

I'd be very grateful if someone could give me any advice about how to solve this problem.

Thanks!

mrna r samtools • 5.0k views
ADD COMMENT
0
Entering edit mode

Did you try to merge the two dataframes? read ?merge and merge on the the geneid.

ADD REPLY
1
Entering edit mode
12.7 years ago
Ido Tamir 5.2k

My guess is you do something like:

library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene')
aligns <- readBamGappedAlignments("my.bam")
counts <- countOverlaps(transcripts(txdb), aligns)
names(counts) <- elementMetadata(tx)[,"tx_name"]

so change the last line to and following:

elementMetadata(txdb)[,"my.counts"] <- counts
df <- as.data.frame(txdb)

and voila you have a data frame with all genomic positions and metadata.

However I would suggest you to look at htseq-count because it has multiple resolution rules for situations that a simple count does not answer. call it from R with system("htseq-count ...") and read in the resulting data.frame.

ADD COMMENT
0
Entering edit mode
12.7 years ago
Cdiez ▴ 150

Hi! Thanks a lot for your useful comments!

I have tried the two suggested methods but I have not had any successful result :(

  1. Yes, you are right (Ido Tamir), I was trying this command line:

library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC(genome='hg19',tablename='ensGene') aligns <- readBamGappedAlignments("my.bam") counts <- countOverlaps(transcripts(txdb), aligns) counts.dframe <- data.frame(counts, stringsAsFactors=FALSE) rownames(counts.dframe) <- names(txdb)

Following this command line I obtained the number of reads overlapping each gene:

CC209858.4_FG002 214 CC209858.4_FG003 0 CC209858.4_FG004 17 CC209859.2_FG006 0 CC209859.2_FG008 0

Now I wanted to add an extra column with the information stored in the Metadata(txdb). I tried the two suggested options suggested command lines but I got an error:

rownames(counts.dframe) <- elementMetadata(txbd)[,"tx_name"]

Error in elementMetadata(txbd)[, "tx_name"] : selecting cols: cannot subset by character when names are NULL

elementMetadata(txbd)[,"my.counts"] <- count.dframe

Error in [<-(*tmp*, , "my.counts", value = list(counts = c(1L, : replacing cols: cannot subset by character when names are NULL

  1. About trying to merge both files. The problem to merge the two files(counts and genes) using the gene ID is that I have to coerce into data.frame both files, and when I do it the number of rows of the txbd.dataframe doesn't correspond with the number of observations this file has as GRangeList (I don't get why..). Then, I can not merge the two datafrmes.

i.e:

Original files Files coerced into data frame

counts (interger[110185]) counts.dframe (110185 obs. of 1 variables) txbd (GRangesList of length 110185) txbd.dframe (392978 obs. of 8 variables)

Then, I don't know how I can access to the information store in the metaData and use it to complete the information about the reads overlapping each gene. I just thinking about to export the files and write and script in Perl to do it, but should be a way to do it in R (I guess...).

Thanks!

ADD COMMENT
0
Entering edit mode
elementMetadata(txbd)[,"my.counts"] <- count.dframe

should be - I did not convert the counts vector into a data.frame:

elementMetadata(txbd)[,"my.counts"] <- count.dframe$Reads

I don't think countOverlaps is appropriate. Try htseq.

ADD REPLY
0
Entering edit mode
12.7 years ago
Michael 54k

The number of rows doesn't have to be equal

Try:

merge(counts, meta, all=TRUE, by.x=1, by.y=1) # set by.x by.y to the column number with the Geneids

That way you will get a full outer join like in SQL with NAs inserted in rows that don't in exist in one of the files.

ADD COMMENT

Login before adding your answer.

Traffic: 1952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6