Question: Creating Count Matrix
0
gravatar for lamia_203
21 months ago by
lamia_20380
lamia_20380 wrote:

I have to create a count table with:

Gene ID   Sample1  Sample2  Sample3
ENSG...   297      0        0

So far I have files with their individual tables, it contains 4 columns with Gene ID, and the remaining three columns read counts. I want to take the second column from each sample and make a count matrix with each sample having their name. How can I make it so that each column has their respective sample?

Thanks

linux rna-seq • 3.2k views
ADD COMMENTlink modified 21 months ago by ahaswer150 • written 21 months ago by lamia_20380

What language/method would you like to use? This can be done using shell, R, python, perl, etc.....you name it. What are you familiar with?

ADD REPLYlink written 21 months ago by seidel7.1k

Sorry I didn't specify. I'm familiar with R and Linux but prefer to make the count table in Linux since there are a lot of samples (about 3200).

ADD REPLYlink written 21 months ago by lamia_20380
3
gravatar for seidel
21 months ago by
seidel7.1k
United States
seidel7.1k wrote:

One suggestion using R would be:

# only return file names with a given pattern
dir(pattern="ReadsPerGene.out.tab")

# save the results to a variable
files <- dir(pattern="ReadsPerGene.out.tab")

counts <- c()
for( i in seq_along(files) ){
  x <- read.table(file=files[i], sep="\t", header=F, as.is=T)
  counts <- cbind(counts, x[,2])
}

# set the row names
rownames(counts) <- x[,1]
# set the column names based on input file names, with pattern removed
colnames(counts) <- sub("_ReadsPerGene.out.tab","",files)

This example assumes your results are each in a set of files with a pattern of ReadsPerGene.out.tab, as you might find using the STAR aligner.

ADD COMMENTlink modified 21 months ago • written 21 months ago by seidel7.1k
3
gravatar for ahaswer
21 months ago by
ahaswer150
Czech Republic
ahaswer150 wrote:

If you are using linux you can also use paste and awk in terminal like so:

paste sample1 sample2 sample3 | awk 'BEGIN {OFS="\t"; FS="\t"}; {print $1','$2','$4','$6}' > count_matrix

It will concatenate specified tables horizontally and extract specified columns. Works for tab-delimited files. If delimiter is different just specify it after "OFS" and "FS".

ADD COMMENTlink written 21 months ago by ahaswer150
1

Since you have a lot of samples (I guess you keep them in separate catalogue) it would be much more convenient to avoid specifying desired columns. Therefore you can open terminal in samples catalogue and run (assuming there are only sample files):

paste * | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

It will work as code above without selecting tons of columns ;)

ADD REPLYlink written 21 months ago by ahaswer150

This does use all the sample thank you. How does the loop for the i part work? When I ran this code, the count_matrix included two repeated columns from each sample rather than one.

Thanks

ADD REPLYlink modified 21 months ago • written 21 months ago by lamia_20380

Maybe you have duplicates inside the catalogue or ran the code twice? Also check if the columns inside sample files are separated with one delimiter without multiplications. You can also use paste command selectively if, for instance, all sample files have the same extension:

paste *.txt | awk 'BEGIN {OFS="\t"; FS="\t"}; {j=$1; for (i=2;i<=NF;i+=2) {j=j FS $i} print j}' > count_matrix

The loop: starting from second column, if "i" equal or less than number of fields (i. e. columns), add 2. For each loop iteration append new column to "j" with specified field separator.

ADD REPLYlink modified 21 months ago • written 21 months ago by ahaswer150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 846 users visited in the last hour