Question: SummarizedExperiment nrow differs from ncol
0
gravatar for Parham
5.5 years ago by
Parham1.4k
Sweden
Parham1.4k wrote:

I am trying to optimize GAGE workflow (1) for my samples. I only have one control and one replicate. I changed parameter "each" from 4 to 1 and I got an error. How should I change it so that it fit?

> grp.idx <- rep(c("experiment", "control"), each=1)
> coldat=DataFrame(grp=factor(grp.idx))
> dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
Error in validObject(.Object) : 
  invalid class “SummarizedExperiment” object: 'colData' nrow differs from 'assays' ncol

 

Cheers!

 

summarizedexperiment deseq2 • 4.3k views
ADD COMMENTlink written 5.5 years ago by Parham1.4k
2

What's the output of ncol(cnts)?

ADD REPLYlink written 5.5 years ago by Devon Ryan93k
1

> ncol(cnts)
[1] 8

 

ADD REPLYlink written 5.5 years ago by Parham1.4k
1

Right, so 8 samples with only 2 having a group assignment. The number of rows of coldat needs to equal the number of samples (i.e., the number of columns in cnts). Otherwise, you end up with samples with no group assignment and DESeq2 has no clue what you really want to do (thus, the error).

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

I see! Then I think I am doing it totally wrong may be! I thought this workflow is something general with basic commands that can be applied to my data so I can get start with! But it turned out more complicated than it seemed! I managed to map my genes with Trapnel et al 2012 protocol for tophat and cufflinks and was quiet successful and I was looking for something like that to learn with, but this one is very confusing to me. 

I haven't changed anything in the workflow and tried it as it was, and surprisingly there are 8 samples?! Shouldn't it be two based on one "experiment" and one "control"? How do I change rows of coldat?! 

Thanks Devon, you helped a lot so far. 

 

ADD REPLYlink written 5.5 years ago by Parham1.4k
1

It is a general workflow, but nothing can be so general that you give incorrect sample-group assignments (or wrong file names, or other things of that sort).

I would have to know how you made cnts to know why it has more columns than the two you expected. There should be one column for each sample, so either you actually do have more samples or you did something wrong when creating that matrix (that'd be my guess). You're making coldat from grp.idx, so just change how you make that. Having said that, don't just randomly change things until the error goes away. The values need to accurately describe the samples, or else you end up with some control samples being in the experiment group (and vice versa) and then the results are meaningless.

You might want to get a book on using R, since knowing the basic concepts will make your life vastly easier (most example workflows assume you know at least the basics).

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

All right, I see your point. Now I got a book on R and I am trying to get a better understanding of it. Since I am a biology grad student and with very little knowledge on statistics and bioinfs, its very useful to get tips from people like you to show the "how to" path. However, now I think I get a better output from the scripts, at least until some points. I am including all the scripts I wrote for my experiment. I have two conditions and one replicate each! Named them pfh1 and pfh1-de! 

> library(GenomicFeatures)

> library(biomaRt)

> txdb<-makeTranscriptDbFromBiomart(biomart="fungi_mart_22", dataset="spombe_eg_gene", host="fungi.ensembl.org")

> exByGn <- exonsBy(txdb, "gene")

> library(Rsamtools)

> fls <- list.files("tophat_all_2/", pattern="bam$", full.names=T)

> bamfls <- BamFileList(fls)

> flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE)

> param <- ScanBamParam(flag=flag)

> library(GenomicAlignments)

> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=FALSE, param=param)

> gnCnt

class: SummarizedExperiment 
dim: 7017 2 
exptData(0):
assays(1): counts
rownames(7017): SPAC1002.01 SPAC1002.02 ... SPSNRNA.06 SPSNRNA.07
rowData metadata column names(0):
colnames(2): pfh1.bam pfh1-de.bam
colData names(0)

> pfh1.cnts=assay(gnCnt)

> require(gageData)

from gnCnt it seems that it sorts my data correctly, having 7017 gene (expected) in 2 columns (conditions). Am I right?

From this step I don't understand the concept of loading hnrnp data from gageData! Is that something I have to skip it for my analysis? That was an amateur mistake! Because the output of dim(pfh1.cnts) is [1] 7017    2!  

I generally don't understand these following scripts! If you can kindly explain them please:

hnrnp.cnts=assay(gnCnt)

require(gageData)
data(hnrnp.cnts)
cnts=hnrnp.cnts

 

Thanks for reading all this!

Cheers!

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Parham1.4k
1

Yeah, it looks like gnCnt has the correct form now. The last three lines just load an example dataset so that you can actually run things in R without having your own data. There's no reason for you to bother with those.

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

Thanks, what is happening here?

> sel.rn=rowSums(cnts) != 0
> cnts=cnts[sel.rn,]

 

ADD REPLYlink written 5.5 years ago by Parham1.4k
1

The first line can be broken into two parts: rowSums(cnts) != 0, which compares the sum of each row to zero and returns True if the sum > 0; and then sel.rn =, which just saves the vector of True/False values. The purpose of that is to just see which rows aren't all zeros. Then cnts=cnts[sel.rn,] subsets cnts to contain only those rows.

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

I see, but then it seems I get another problem, when I run it on my data! I lose all genes! All of them have zero values?! 

> cnts=assay(gnCnt)
> dim(cnts)
[1] 7017    2
> sel.rn=rowSums(cnts) !=0
> cnts=cnts[sel.rn,]
> dim(cnts)
[1] 0 2

 

 

How can I avoid this?

 

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Parham1.4k

Check that the chromosome names of the BAM files match those of txdb. If they do, look at things in IGV or another genome browser.
 

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

How do I check the chromosome names?

ADD REPLYlink written 5.5 years ago by Parham1.4k
1

From the command line: samtools view -H foo.bam (change foo.bam to the name of a BAM file). For that genome, the chromosome should be I, II, III, MT and AB325691.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Devon Ryan93k

Yes it looks it has the correct headers! What should I seek for in IGV? 

@HD    VN:1.0    SO:coordinate
@SQ    SN:AB325691    LN:20000
@SQ    SN:I    LN:5579133
@SQ    SN:II    LN:4539804
@SQ    SN:III    LN:2452883
@SQ    SN:MT    LN:19431
@SQ    SN:MTR    LN:20128
@PG    ID:TopHat    VN:2.0.9    CL:/usr/bin/tophat -p 8 -o tophat_out_tag4 /home/parham/pombe/annotations/Schizosaccharomyces_pombe_Ensembl_EF2/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/genome /home/parham/pombe/uppmax_data/131128_SN866_0274_BC2N9CACXX/Sample_S0001_tag4/S0001_tag4_TGACCA_L004_R1_001.fastq

 

ADD REPLYlink written 5.5 years ago by Parham1.4k

I also checked IGV. However I didn't know what you exactly mean but I randomly checked some genes and their mapped reads, which looked fine! Is there any thing specific I have to look on?

ADD REPLYlink written 5.5 years ago by Parham1.4k

There's nothing specific. The goal it just to eye-ball a couple genes and see if they actually have reads mapping to them. If they do and the counts you're getting with summarizeOverlaps() to even remotely match, then something is going wrong at that step.
 

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

Do you have any idea what is going wrong?! 

ADD REPLYlink written 5.5 years ago by Parham1.4k

Not without seeing the data. You can always post the BAM file somewhere and I or someone else can just have a look.

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

All right! I can do that but I don't know where do people upload! If you could provide some links please! 

ADD REPLYlink written 5.5 years ago by Parham1.4k

Dropbox, google drive, just google for "large file send"

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

Hi Devon,

I have been having troubles with getting gage working and was following this thread to sort the issue. My chromosomes have the wrong headers. Is there an easy way to change these? 

 

cheers in advance

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by justin.mitchell0

Hi Justin,

Please post things like this as new questions, they're easier to track and will receive quicker responses that way.

SummarizedExperiment objects inherit the seqlevels() accessor. So you can just change all of the names in one relatively simple command, like seqlevels(se) <- sprintf("chr%s", as.character(seqlevels(se))). Something along those lines should work for you.

ADD REPLYlink written 5.2 years ago by Devon Ryan93k
1

You have single-end data, so you need this

flag <- scanBamFlag(isNotPrimaryRead=FALSE)

instead of this

flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE)

Also, this

gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=FALSE, param=param)

should instead be this

gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, param=param)

 

ADD REPLYlink written 5.5 years ago by Devon Ryan93k

Thanks a million Devon! It is fine now!

ADD REPLYlink written 5.5 years ago by Parham1.4k

Devon may I ask how you looked up for single-end/pair-end in the BAM file please? 

ADD REPLYlink written 5.1 years ago by Parham1.4k

I don't recall exactly what I did 4 months ago. If I had the BAM file, then I just looked at the flag fields. If not, I see that the header in one of your previous comments has the tophat command used, and that's only using single-end reads, so perhaps I went off of that (though I vaguely remember just looking at the dataset).
 

ADD REPLYlink written 5.1 years ago by Devon Ryan93k

Yes, mine was a stupid question! I shared the BAM file here and you just checked something and said that I have single-end data! So do I have to convert BAM to SAM and check for flags?! If so is there any place explaining the flags? 

ADD REPLYlink written 5.1 years ago by Parham1.4k
1

Check here or the SAM specification if you really want more details.

ADD REPLYlink written 5.1 years ago by Devon Ryan93k
1

BTW, rep(c("experiment", "control"), each=1) is the same as c("experiment", "control").

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1620 users visited in the last hour