Question

htseq-counts output merge into one matrix ??

3

Entering edit mode

9.8 years ago

bioinforupesh2009.au ▴ 140

Dear all,

I just need a little help to merge my all features counts into one matrix. I have counted features using htseq-counts and now want to merge into one file like....

ID c1 c2 c3..............t1 t2 t2.......etc

The problem is that I have 36 sample and corresponding counts so I am lazy to write ...

paste *.txt | cut -d -f1,2,4,6,8...............................72 #(because each file contain two column i.e ID and counts)

and moreover, apart from laziness, one problem also happened here that while pasting its not in sequence like:

paste 1.txt 2.txt 3.txt............

Help me please...

rna-seq • 24k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by bioinforupesh2009.au ▴ 140

2

Entering edit mode

Finally got a solution to this problem

#!/bin/bash
FILES=$(ls -t -v *.txt | tr '\n' ' ');
awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' $FILES > merged.tmp

This woks like you asked.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by EagleEye 7.6k

0

Entering edit mode

Thank you man for your kind reply but I have already done, I am sorry for the delay. Was just needed to change the file name like 1.txt -> 01.txt, 2.txt -> 02.txt....................... etc and after I used your suggested awk code and worked fine.

By the way thanks a million.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by bioinforupesh2009.au ▴ 140

Ram · Answer 1 · 2014-10-22

3

Entering edit mode

9.8 years ago

EagleEye 7.6k

This could be the simple way for as many as files:

awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' read_counts/*.gene_counts.htseq > merged.htseq

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by EagleEye 7.6k

1

Entering edit mode

This outputs: Merge htseq files into one based on geneNames. Sample output

ENSG00000251243.1    0    0    0    0    0
ENSG00000240637.2    0    0    0    0    0
ENSG00000227366.1    0    0    0    0    0
ENSG00000104903.4    665    216    884    3254    689
ENSG00000228626.1    0    0    0    0    1

Forgot to mention, the order of pasted column will be in FileName order.

Like if you do listing using unix command ls -l read_counts/

You will get the order in which it has merged.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by EagleEye 7.6k

0

Entering edit mode

Dear Santhilal,

Your code is working as mine too...but the main problem what I mentioned on question is.......ITS NOT IN SEQUENCE.

I NEED...

ID     C1 C1 C3.............UPTO 36

your code or mine work as terminal shows the directory by hit of ls

10.txt
11.txt
31.txt
12.txt
13.txt
4.txt
3.txt
1.txt
20.txt
...
....
....
36.txt

However, I tried by

for I in $(seq 1 36); do awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' $i.txt > merged.htseq; done

but its not working?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by bioinforupesh2009.au ▴ 140

0

Entering edit mode

you can try this

ls -l -v | awk '{print $9}'

and then AWK. I don't know whether it will work. But you can try it.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by EagleEye 7.6k

0

Entering edit mode

Thanks but while using this only its work but not with advised final awk code.

I used...but not worked for me.

ls -l -v | awk '{print $9}' | awk 'NF > 1{ a[$1] = a[$1]"\t"$2} END {for( I in a ) print I a[i]}' *.txt > merged.htseq

however, I have done by my way

Thanks anyway....very kind of you. :)

by the way

Wishing you a very happy and prosperous Diwali!!

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by bioinforupesh2009.au ▴ 140

0

Entering edit mode

Could please provide your working solution?

ADD REPLY • link 6.9 years ago by Ric ▴ 430

score 3 · Answer 2 · 2018-06-07

3

Entering edit mode

6.1 years ago

Arindam Ghosh ▴ 520

I managed to do this by R. The script is available here. It works for me. Please try and check if it works fine.

ADD COMMENT • link 6.1 years ago by Arindam Ghosh ▴ 520

Ram · Answer 3 · 2014-10-22

1

Entering edit mode

9.8 years ago

Devon Ryan 104k

I wrote this a while ago. It can also change the file labels.

Note that it's so easy to just do this in R that there's really little reason to manually merge files like this.

ADD COMMENT • link 9.8 years ago by Devon Ryan 104k

1

Entering edit mode

and if you use DESeq2 you can simply give the file names in input ( see DESeq2 vignette 1.2.4 HTseq input )

ADD REPLY • link 9.8 years ago by Nicolas Rosewick 11k

1

Entering edit mode

Exactly, and even if you're using something like edgeR, it's simple enough to just use the same method employed by DESeq2 (namely, lapply() reading files and then lapply a function to get the counts out).

ADD REPLY • link 9.8 years ago by Devon Ryan 104k

1

Entering edit mode

i prefer first by terminal or bash.

Thanks by the for your valuable answer.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by bioinforupesh2009.au ▴ 140

score 0 · Answer 4 · 2016-07-28

0

Entering edit mode

8.0 years ago

bmahesh07 • 0

I found a useful R script (htseq-combine_all.R script) for merging HT-seq counts at the following resource http://wiki.bits.vib.be/index.php/NGS_RNASeq_DE_Exercise.4

ADD COMMENT • link 8.0 years ago by bmahesh07 • 0

0

Entering edit mode

I tried the the above R script (htseq-combine_all.R script) but I ran into an error

ADD REPLY • link 6.9 years ago by Ric ▴ 430

0

Entering edit mode

You can use my R script on https://github.com/AAlhendi1707/htseq-merge/blob/master/htseq-merge_all.R

ADD REPLY • link 5.4 years ago by Ahmed Alhendi ▴ 240

score 0 · Answer 5 · 2019-02-08

0

Entering edit mode

5.5 years ago

oriolebaltimore ▴ 190

Is it considered best practice to sum up reads across replicates or to run HTSeq on merged BAM File after ccombing fastqs. Thanks Adrian

ADD COMMENT • link 5.5 years ago by oriolebaltimore ▴ 190

1

Entering edit mode

No, you should use hts-seq on each replicates otherwise there is no meaning of replicates. By the way, you could use "featureCounts tool" for each bam at a time. Then, further you don't need to merge as its give you a matrix of count for each sample of bam.

Hope this helps Good luck.

ADD REPLY • link 5.5 years ago by unique379 ▴ 120

score 0 · Answer 6 · 2019-03-20

0

Entering edit mode

5.4 years ago

Ahmed Alhendi ▴ 240

This is easily done using R script! You can use my R script on https://github.com/AAlhendi1707/htseq-merge/blob/master/htseq-merge_all.R

For more details about how to generate and combine HTSeq outputs into one read count matrix in R https://ahmedalhendi0.wordpress.com/2019/03/20/combine-htseq-outputs-into-one-read-count-matrix-in-r/

ADD COMMENT • link 5.4 years ago by Ahmed Alhendi ▴ 240

0

Entering edit mode

Hello, I followed your script and ran it on Linux, but I got an error message. My input files were in the same directory where I run R script, but it said cant read files. How can I fix it?

Thx .

ADD REPLY • link 4.0 years ago by Aynur ▴ 60

0

Entering edit mode

Recheck on path and file format.

ADD REPLY • link 4.0 years ago by Arindam Ghosh ▴ 520

0

Entering edit mode

Hello Dr.Alhendi, I am totally new to this bioinformatics, so please bear with my stupidity here, if any. Thanks. I did check again. Here is how I used the Rscript. 1. I could not use it on Mac OS terminal. 2. I cant use it on Mac OS R. 3. I tried to use it on Linux cluster as follows.

`Rscript htseq-merge_all.R /kuacc/users/akashgari19/HTSeq  HTSeq_Merged.txt`

R script is located on the same directory. My HTSeq out files are located on this directory:

`/kuacc/users/akashgari19/HTSeq`

My files are named below : CON-1_HTSeq.txt treatment-1_HTSeq.txt ....... total 10 files.

`head CON-1_HTSeq.txt.                                                                     ENSMUSG00000000001   3141 
 ENSMUSG00000000003 0.                                                                  ENSMUSG00000000028  854

ENSMUSG00000000031 822918 ENSMUSG00000000037 1 ENSMUSG00000000049 3 ENSMUSG00000000056 3897 ENSMUSG00000000058 1399 ENSMUSG00000000078 1766 ENSMUSG00000000085 706` When I run the script I got the following errors. [1] "############################" [1] "## Files to be merged are: ##" "CON-1_HTSeq.txt" "CON-2_HTSeq.txt" ........

file read START ######### Aug_12_2020_13_18_31_+03" Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file '/kuacc/users/akashgari19/HTSeq/CON-1_HTSeq.cnt': No such file or directory. Execution halted

Yes, I do not have files with .cnt extension.
Please help me, how can I solve this. Thanks.

ADD REPLY • link 4.0 years ago by Aynur ▴ 60

score 0 · Answer 7 · 2020-06-05

I decided to revive the thread adding my short tidiverse-based solution in R. Assuming we are in R session in the directory with htseq-counts files:

library(tidyverse)
counts_files_pattern = "\\.counts\\.txt$"
counts_table <-
  list.files(pattern = counts_files_pattern) %>%
  #apply an import function to the listed files
  map(function(x) { 
        print(x)
        read_table2(x,
                #rename second column in accordance with the file name
                col_names = c("ID", x))
      }
      ) %>%
  #merge all tables into one
  reduce(left_join, by = "ID") %>% 
  # clean up column names
  rename_all(gsub,
             pattern = counts_files_pattern,
             replacement = '') %>%
  #move ID column to row names
  column_to_rownames("ID")

score 0 · Answer 8 · 2021-08-27

This is the shortest code I can possibly write. It removes those feature descriptions of HTSeq counts and imports a neat and clean matrix from combining all the samples.

library(dplyr)
path <- "C:/Users/Rohit Satyam/Desktop/RNASeq workshop/backup/"
files <- list.files(path,".txt", full.names = T)
## making a list of data frames with row and column names while removing last five lines from HTSeq that contains the features and ambiguous read infromation
dfList <- lapply(files, function(x) {
  read.csv(x, nrows = length(count.fields(x))-5, sep = "\t", header = F, row.names = 1, col.names = c("genes",tools::file_path_sans_ext(basename(x))))})
## combining all the dataframes into single dataframe
matrix <- bind_cols(dfList)