Question: Create table of gene expression for multiple samples
0
gravatar for microbiotaiota
6 weeks ago by
microbiotaiota20 wrote:

I have 60 samples and my PI wants a table (excel spreadsheet) containing all expressed genes in 60 samples. I have used hisat and stringtie. I don't know how to create such a table unless I code something and my coding skills are rudimentary. Or that I manually extract the expression values and paste them into an excel file 60 times using vlookup.

Is there a tool that I can use for this?

Additional information: I have geneabundance.tab files and .gtf files from stringtie.

Geneabundance.tab

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000228794.9   LINC01128   chr1    +   860226  868202  0.000000    0.000000    0.000000
ENSG00000230368.2   FAM41C  chr1    -   868071  876903  0.000000    0.000000    0.000000
ENSG00000234711.1   TUBB8P11    chr1    +   873292  874349  0.108696    0.021078    0.030533

.gtf file

chr1    HAVANA  transcript  860226  866720  .   +   .   gene_id "ENSG00000228794.9"; transcript_id "ENST00000671208.1"; ref_gene_name "LINC01128"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    HAVANA  exon    860226  861273  .   +   .   gene_id "ENSG00000228794.9"; transcript_id "ENST00000671208.1"; exon_number "1"; ref_gene_name "LINC01128"; cov "0.0";
rna-seq stringtie hisat • 263 views
ADD COMMENTlink modified 6 weeks ago by Kevin Blighe65k • written 6 weeks ago by microbiotaiota20

You will need to tell us what format the data you currently have is.

ADD REPLYlink written 6 weeks ago by Joe17k

I have geneabundance.tab and .gtf files from stringtie (updated in original post)

ADD REPLYlink written 6 weeks ago by microbiotaiota20

Just show us a snippet of each file. The files should be tabular/delimited anyway most likely. What are you defining as 'expressed' in this data?

ADD REPLYlink written 6 weeks ago by Joe17k

Hopefully it is clear, there aren't commas in the real file, added them to make it a bit more clearer. By expressed the PI wants anything 0 and above (so all genes, sorry for the confusion).

ADD REPLYlink written 6 weeks ago by microbiotaiota20

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

You should not need to add anything that is not present in the original data in that case.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax89k

Thank you, fixed it up now

ADD REPLYlink written 6 weeks ago by microbiotaiota20

You can either use featurecounts or multiBamSummary from deeptools to generate the gene expression table.

ADD REPLYlink written 6 weeks ago by Gjain5.5k

Thanks, I'll give it a crack

ADD REPLYlink written 6 weeks ago by microbiotaiota20

Are all the tab files in the same folder? If not, how are the files stored? It's important to be specific since the code would need to loop through the files.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by rpolicastro1.7k

I have both, I thought that having each file in the same directory might make it easier. I'll be sure to clarify in future posts.

ADD REPLYlink written 6 weeks ago by microbiotaiota20

See if the answer here is of any help:

A: Unable to create file for tximport

ADD REPLYlink written 6 weeks ago by h.mon31k
2
gravatar for Kevin Blighe
6 weeks ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

In R, some general guidance:

1,obtain a list of all files

myfiles <- list.files(...)

2, use lapply() to read all files into a list

mydata <- lapply(myfiles, function(x) {
  read.table(x, sep = '\t', dec = '.',
    header = TRUE, row.names = 1, stringsAsFactors = FALSE)
})
names(mydata) <- myfiles

3, summarise the data into a single matrix

mydata_tpm <- lapply(mydata, function(x) data.frame(TPM = x[['TPM']]))
mat <- do.call(cbind, mydata_tpm)

I trust that you can fill in everything else.

Kevin

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe65k

By the way, you can pass arguments that would go to the function (argument: FUN) directly to lapply:

lapply(myfiles, read.table, sep = '\t', dec = '.', header = TRUE, row.names = 1, stringsAsFactors = FALSE)
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by RamRS30k

Didn't know that - thanks!

ADD REPLYlink written 6 weeks ago by Kevin Blighe65k

Thank you! I have given it a try, however I have run into 2 issues.

Firstly, when running the below I get duplicate 'row.names' are not allowed and I think the this comes down to row.names=1. I checked the files to see if there were any duplicate Gene IDs and there weren't (cat file.tab | cut -d $'\t' -f 1 | sort | uniq -d). I changed it to row.names=NULL and it worked. I think the issue is because there's an extra column in the header or first row.

mydata <- lapply(myfiles, function(x) {read.table(x, sep = '\t', dec = '.', header = TRUE, row.names = 1, stringsAsFactors = FALSE)})

Secondly, mat <- do.call(cbind, mydata_tpm) produces the below error.

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 60625, 60622, 60621, 60620, 60626, 60624, 60618, 60623, 60619, 60627

After having a look at the number of lines in each geneabundance.tab file shows that they have different number of lines. Additionally, I assume that since there are different number of genes then there are different genes in each .tab file. I will have to account for that then somehow.

Sorry if this is frazzled, I have spent the last couple hours trying to find the solution.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by microbiotaiota20
2
gravatar for rpolicastro
6 weeks ago by
rpolicastro1.7k
rpolicastro1.7k wrote:

Try this. It accounts for the files having different numbers of rows.

library("data.table")
library("purrr")

myfiles <- setNames(list.files(...), basename(list.files(...))

mydata <- imap(myfiles, function(x, y) {
  DT <- fread(x)
  setnames(DT, old="TPM", new=str_c("TPM_", y))
  DT[, c("Gene ID", "Reference", "Strand", "Start", "End", "Coverage") := NULL]
  return(copy(DT))
})

mydata <- reduce(mydata, merge, by="Gene Name", all=TRUE)
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by rpolicastro1.7k

This works as well! Thank you.

I had to change mydata to myfiles.

mydata <- imap(myfiles, function(x, y) {
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by microbiotaiota20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour