Question

For loop error for merging HTSeq counts

0

Entering edit mode

3.7 years ago

Aynur ▴ 60

Hello,

Please help me with the following code. I start with what I have done and where did I get wrong.

I think there must be something wrong, so I am not getting my all HTSeq data merged into one.

As a beginner, I followed this tutorial

samples <- c("CON-1","CON-2","A-1","A-2", "IL6-1","IL6-2","LIF-1","LIF-2","OS-1","OS-2")  

# A function to read one of the count files produced by HTSeq

read.sample <- function( sample.name ) {
    file.name <- paste( sample.name, "_htseq_counts.txt", sep="" )
    result <- read.delim( file.name, col.names=c("gene", "count"), sep="\t", 
                          colClasses=c("character", "numeric" ) )
}

# Read the first sample
sample.1 <- read.sample(samples[1])

Then I got the expected result as

head(sample.1)
1 ENSMUSG00000000003      0
2 ENSMUSG00000000028    854
3 ENSMUSG00000000031 822918
4 ENSMUSG00000000037      1
5 ENSMUSG00000000049      3

nrow(sample.1)
55405

All these are correct.

# to make sure the first and second samples have the same number of rows and the same genes in each row
nrow(sample.1) == nrow(sample.2)
[1] TRUE

all(sample.1$gene == sample.2$gene)
[1] TRUE

Both are TRUE as expected.

# to merge 
all.data <- sample.1
all.data <- cbind(sample.1, sample.2$count)
for (c in 3:length(samples)) {
    temp.data <- read.sample(samples[c])
    all.data <- cbind(all.data, temp.data$count)
}

# all data merged    
head(all.data)
gene  count sample.2$count.   
1 ENSMUSG00000000003      0              0
2 ENSMUSG00000000028    854            937
3 ENSMUSG00000000031 822918          81745
4 ENSMUSG00000000037      1              2
5 ENSMUSG00000000049      3              3

So obviously the merging loop did not do the job. Please help me.

Please help me with for loop to merge. I can continue with DESeq2 afterward.

Thanks. For clarification, sorry for all the inconvience. I joined this forum few days ago, and it is been so helpful. I am also very new to this coding, and I will try to make a better post afterward. After discouraged from using the cuffdiff pipeline, I had to make a switch to STAR-HTSeq- DESeq2 pipeline.

R RNA-Seq rna-seq sequencing • 1.7k views

ADD COMMENT • link updated 3.7 years ago by karl.stamm 4.1k • written 3.7 years ago by Aynur ▴ 60

1

Entering edit mode

Sorry, I don't know how to enter a table here, so my output table is shown as lines here.

Use the code button (the button with 101010) to format large code blocks, it can also be used to provide better formatting for tables. As far as I am aware, the current markdown parser does not have table support.

edit: by the way, the tutorial you are following is from 2014, it is a lifetime ago in terms of tools used and best practices. If I am not mistaken DESeq has been deprecated in favour of DESeq2, for example.

ADD REPLY • link 3.7 years ago by h.mon 35k

0

Entering edit mode

User pinged me on another thread. please use the 101010 button to improve the layout of your post.

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you for pointing out the old website, but I can continue with DESeq2. My problem is about looping, even before using DESeq. Can you please help me with my for loop to merge HTSeq into one file, then I can continue with DESeq2. From other tutorials, I could not create the DESeq2 object, but this one was easier to follow so far, but my for loop did not work. Please help.

ADD REPLY • link 3.7 years ago by Aynur ▴ 60

1

Entering edit mode

Your post is almost unreadable, as code, output and text are intermingled. I tried to clean up a bit, but I gave up, as the underlying post is so messy and I was afraid of messing up even more.

ADD REPLY • link 3.7 years ago by h.mon 35k

0

Entering edit mode

Sorry for the inconvenience. Thank you for pointing out. Let me try to make the post readable first.

ADD REPLY • link 3.7 years ago by Aynur ▴ 60

0

Entering edit mode

I just done some formatting, see if I didn't mess things up.

ADD REPLY • link 3.7 years ago by h.mon 35k

0

Entering edit mode

Ah I just did some too. Formatting was a freaking disaster.

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Please stop editing your post. Add comments instead - you're butchering all the cleaning up that many others are doing.

I'm closing your post until it is proper and final.

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Why you are not using featureCounts?

ADD REPLY • link 3.7 years ago by Arup Ghosh 3.2k

0

Entering edit mode

I just learned this HTSeq and made all files. Thats why.

ADD REPLY • link 3.7 years ago by Aynur ▴ 60

1

Entering edit mode

You spoke about HTSeq, then started with STAR, and then when someone was willing to hold your hand through the process of kallisto you switched to that and now you're back to HTSeq. Pick one workflow and stick to it please.

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Hello akashagri19!

I'm closing this post until it is cleaned up. Once you're done adding all necessary content, let me know and I'll re-open the post.

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Hello, Can you please open it now? Thanks.

ADD REPLY • link 3.7 years ago by Aynur ▴ 60

score 3 · Answer 1 · 2020-08-13

You've found a place where R does not behave like a 'normal' programming language. You're writing for another language and picking up similar R commands, but it doesn't work like that.

cbind() is a columnar bind, and will probably give you trouble when the rows are out of order or data is missing. Better to use merge(). cbind() looks like it is working because all your sample files are formatted exactly the same, but that won't be the case next time. merge() is safer because it looks for the same geneID. And has rules for filling in missing values or duplicating, your preference.

Your first problem is the for loop and what's called lexical scoping. here all.data is defined outside the loop, then reassigned inside a {} which creates a local variable akin to temp.data.

all.data <- cbind(sample.1, sample.2$count)
for (c in 3:length(samples)) {
    temp.data <- read.sample(samples[c])
    all.data <- cbind(all.data, temp.data$count)
}

So all.data is set once and then reset inside using the other 'all.data'. It's clearer if I rewrite what youve got like this:

x <- cbind(sample.1, sample.2$count)
for (i in 3:length(samples)) {
   z <- read.sample(samples[i])
   y <- cbind(x, z$count) 
}

You see, y never gets more than 2 columns, and x was never touched.

Two solutions. You could refer to the outer all.data by using the environment assignment operator or "<<-" to go up one level (see the assign() function). This is not good practice: you'll keep running into problems using this bandaid. Better to avoid the for() and fancy variable scoping by using vectorized functions:

all.data <- do.call(cbind, 
                    lapply(samples, read.sample) )

That's going to list-apply read.sample to 'samples' and get you a list of matrices, which is given to do.call to create a big wide cbind().

Be sure to check the help docs using "?" on each command I've mentioned please.