Question: For loop error for merging HTSeq counts
0
gravatar for Aynur
5 weeks ago by
Aynur10
Aynur10 wrote:

Hello,

Please help me with the following code. I start with what I have done and where did I get wrong.

I think there must be something wrong, so I am not getting my all HTSeq data merged into one.

As a beginner, I followed this tutorial

samples <- c("CON-1","CON-2","A-1","A-2", "IL6-1","IL6-2","LIF-1","LIF-2","OS-1","OS-2")  

# A function to read one of the count files produced by HTSeq

read.sample <- function( sample.name ) {
    file.name <- paste( sample.name, "_htseq_counts.txt", sep="" )
    result <- read.delim( file.name, col.names=c("gene", "count"), sep="\t", 
                          colClasses=c("character", "numeric" ) )
}

# Read the first sample
sample.1 <- read.sample(samples[1])

Then I got the expected result as

head(sample.1)
1 ENSMUSG00000000003      0
2 ENSMUSG00000000028    854
3 ENSMUSG00000000031 822918
4 ENSMUSG00000000037      1
5 ENSMUSG00000000049      3

nrow(sample.1)
55405

All these are correct.

# to make sure the first and second samples have the same number of rows and the same genes in each row
nrow(sample.1) == nrow(sample.2)
[1] TRUE

all(sample.1$gene == sample.2$gene)
[1] TRUE

Both are TRUE as expected.

# to merge 
all.data <- sample.1
all.data <- cbind(sample.1, sample.2$count)
for (c in 3:length(samples)) {
    temp.data <- read.sample(samples[c])
    all.data <- cbind(all.data, temp.data$count)
}

# all data merged    
head(all.data)
gene  count sample.2$count.   
1 ENSMUSG00000000003      0              0
2 ENSMUSG00000000028    854            937
3 ENSMUSG00000000031 822918          81745
4 ENSMUSG00000000037      1              2
5 ENSMUSG00000000049      3              3

So obviously the merging loop did not do the job. Please help me.

Please help me with for loop to merge. I can continue with DESeq2 afterward.

Thanks. For clarification, sorry for all the inconvience. I joined this forum few days ago, and it is been so helpful. I am also very new to this coding, and I will try to make a better post afterward. After discouraged from using the cuffdiff pipeline, I had to make a switch to STAR-HTSeq- DESeq2 pipeline.

sequencing rna-seq R • 206 views
ADD COMMENTlink modified 5 weeks ago by karl.stamm3.8k • written 5 weeks ago by Aynur10
1

Sorry, I don't know how to enter a table here, so my output table is shown as lines here.

Use the code button (the button with 101010) to format large code blocks, it can also be used to provide better formatting for tables. As far as I am aware, the current markdown parser does not have table support.

edit: by the way, the tutorial you are following is from 2014, it is a lifetime ago in terms of tools used and best practices. If I am not mistaken DESeq has been deprecated in favour of DESeq2, for example.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by h.mon31k

User pinged me on another thread. please use the 101010 button to improve the layout of your post.

ADD REPLYlink written 5 weeks ago by Kevin Blighe65k

Thank you for pointing out the old website, but I can continue with DESeq2. My problem is about looping, even before using DESeq. Can you please help me with my for loop to merge HTSeq into one file, then I can continue with DESeq2. From other tutorials, I could not create the DESeq2 object, but this one was easier to follow so far, but my for loop did not work. Please help.

ADD REPLYlink written 5 weeks ago by Aynur10
1

Your post is almost unreadable, as code, output and text are intermingled. I tried to clean up a bit, but I gave up, as the underlying post is so messy and I was afraid of messing up even more.

ADD REPLYlink written 5 weeks ago by h.mon31k

Sorry for the inconvenience. Thank you for pointing out. Let me try to make the post readable first.

ADD REPLYlink written 5 weeks ago by Aynur10

I just done some formatting, see if I didn't mess things up.

ADD REPLYlink written 5 weeks ago by h.mon31k

Ah I just did some too. Formatting was a freaking disaster.

ADD REPLYlink written 5 weeks ago by RamRS30k

Please stop editing your post. Add comments instead - you're butchering all the cleaning up that many others are doing.

I'm closing your post until it is proper and final.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by RamRS30k

Why you are not using featureCounts?

ADD REPLYlink written 5 weeks ago by Arup Ghosh2.7k

I just learned this HTSeq and made all files. Thats why.

ADD REPLYlink written 5 weeks ago by Aynur10
1

You spoke about HTSeq, then started with STAR, and then when someone was willing to hold your hand through the process of kallisto you switched to that and now you're back to HTSeq. Pick one workflow and stick to it please.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by RamRS30k

Hello akashagri19!

I'm closing this post until it is cleaned up. Once you're done adding all necessary content, let me know and I'll re-open the post.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by RamRS30k

Hello, Can you please open it now? Thanks.

ADD REPLYlink written 5 weeks ago by Aynur10
3
gravatar for karl.stamm
5 weeks ago by
karl.stamm3.8k
United States
karl.stamm3.8k wrote:

You've found a place where R does not behave like a 'normal' programming language. You're writing for another language and picking up similar R commands, but it doesn't work like that.

cbind() is a columnar bind, and will probably give you trouble when the rows are out of order or data is missing. Better to use merge(). cbind() looks like it is working because all your sample files are formatted exactly the same, but that won't be the case next time. merge() is safer because it looks for the same geneID. And has rules for filling in missing values or duplicating, your preference.

Your first problem is the for loop and what's called lexical scoping. here all.data is defined outside the loop, then reassigned inside a {} which creates a local variable akin to temp.data.

all.data <- cbind(sample.1, sample.2$count)
for (c in 3:length(samples)) {
    temp.data <- read.sample(samples[c])
    all.data <- cbind(all.data, temp.data$count)
}

So all.data is set once and then reset inside using the other 'all.data'. It's clearer if I rewrite what youve got like this:

x <- cbind(sample.1, sample.2$count)
for (i in 3:length(samples)) {
   z <- read.sample(samples[i])
   y <- cbind(x, z$count) 
}

You see, y never gets more than 2 columns, and x was never touched.

Two solutions. You could refer to the outer all.data by using the environment assignment operator or "<<-" to go up one level (see the assign() function). This is not good practice: you'll keep running into problems using this bandaid. Better to avoid the for() and fancy variable scoping by using vectorized functions:

all.data <- do.call(cbind, 
                    lapply(samples, read.sample) )

That's going to list-apply read.sample to 'samples' and get you a list of matrices, which is given to do.call to create a big wide cbind().

Be sure to check the help docs using "?" on each command I've mentioned please.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by karl.stamm3.8k
1

And if you want to install more third party packages, plyr and dplyr, you'll get access to "bind_cols" which does the same as my do.call construct above, but better. cbind still will fail when the rows arent perfectly matched but dplyr::bind_rows and bind_cols will intelligently create missing entries. Again, your preference for the behavior.

ADD REPLYlink written 5 weeks ago by karl.stamm3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1138 users visited in the last hour