Create table of gene expression for multiple samples
2
0
Entering edit mode
2.1 years ago

I have 60 samples and my PI wants a table (excel spreadsheet) containing all expressed genes in 60 samples. I have used hisat and stringtie. I don't know how to create such a table unless I code something and my coding skills are rudimentary. Or that I manually extract the expression values and paste them into an excel file 60 times using vlookup.

Is there a tool that I can use for this?

Additional information: I have geneabundance.tab files and .gtf files from stringtie.

Geneabundance.tab

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000228794.9   LINC01128   chr1    +   860226  868202  0.000000    0.000000    0.000000
ENSG00000230368.2   FAM41C  chr1    -   868071  876903  0.000000    0.000000    0.000000
ENSG00000234711.1   TUBB8P11    chr1    +   873292  874349  0.108696    0.021078    0.030533


.gtf file

chr1    HAVANA  transcript  860226  866720  .   +   .   gene_id "ENSG00000228794.9"; transcript_id "ENST00000671208.1"; ref_gene_name "LINC01128"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    HAVANA  exon    860226  861273  .   +   .   gene_id "ENSG00000228794.9"; transcript_id "ENST00000671208.1"; exon_number "1"; ref_gene_name "LINC01128"; cov "0.0";

RNA-Seq stringtie hisat • 900 views
0
Entering edit mode

You will need to tell us what format the data you currently have is.

0
Entering edit mode

I have geneabundance.tab and .gtf files from stringtie (updated in original post)

0
Entering edit mode

Just show us a snippet of each file. The files should be tabular/delimited anyway most likely. What are you defining as 'expressed' in this data?

0
Entering edit mode

Hopefully it is clear, there aren't commas in the real file, added them to make it a bit more clearer. By expressed the PI wants anything 0 and above (so all genes, sorry for the confusion).

0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better.

You should not need to add anything that is not present in the original data in that case.

0
Entering edit mode

Thank you, fixed it up now

0
Entering edit mode

You can either use featurecounts or multiBamSummary from deeptools to generate the gene expression table.

0
Entering edit mode

Thanks, I'll give it a crack

0
Entering edit mode

Are all the tab files in the same folder? If not, how are the files stored? It's important to be specific since the code would need to loop through the files.

0
Entering edit mode

I have both, I thought that having each file in the same directory might make it easier. I'll be sure to clarify in future posts.

0
Entering edit mode

See if the answer here is of any help:

A: Unable to create file for tximport

2
Entering edit mode
2.1 years ago

In R, some general guidance:

## 1,obtain a list of all files

myfiles <- list.files(...)


## 2, use lapply() to read all files into a list

mydata <- lapply(myfiles, function(x) {
read.table(x, sep = '\t', dec = '.',
header = TRUE, row.names = 1, stringsAsFactors = FALSE)
})
names(mydata) <- myfiles


## 3, summarise the data into a single matrix

mydata_tpm <- lapply(mydata, function(x) data.frame(TPM = x[['TPM']]))
mat <- do.call(cbind, mydata_tpm)


I trust that you can fill in everything else.

Kevin

0
Entering edit mode

By the way, you can pass arguments that would go to the function (argument: FUN) directly to lapply:

lapply(myfiles, read.table, sep = '\t', dec = '.', header = TRUE, row.names = 1, stringsAsFactors = FALSE)

0
Entering edit mode

Didn't know that - thanks!

0
Entering edit mode

Thank you! I have given it a try, however I have run into 2 issues.

Firstly, when running the below I get duplicate 'row.names' are not allowed and I think the this comes down to row.names=1. I checked the files to see if there were any duplicate Gene IDs and there weren't (cat file.tab | cut -d \$'\t' -f 1 | sort | uniq -d). I changed it to row.names=NULL and it worked. I think the issue is because there's an extra column in the header or first row.

mydata <- lapply(myfiles, function(x) {read.table(x, sep = '\t', dec = '.', header = TRUE, row.names = 1, stringsAsFactors = FALSE)})


Secondly, mat <- do.call(cbind, mydata_tpm) produces the below error.

Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 60625, 60622, 60621, 60620, 60626, 60624, 60618, 60623, 60619, 60627


After having a look at the number of lines in each geneabundance.tab file shows that they have different number of lines. Additionally, I assume that since there are different number of genes then there are different genes in each .tab file. I will have to account for that then somehow.

Sorry if this is frazzled, I have spent the last couple hours trying to find the solution.

2
Entering edit mode
2.1 years ago

Try this. It accounts for the files having different numbers of rows.

library("data.table")
library("purrr")

myfiles <- setNames(list.files(...), basename(list.files(...))

mydata <- imap(myfiles, function(x, y) {
setnames(DT, old="TPM", new=str_c("TPM_", y))
DT[, c("Gene ID", "Reference", "Strand", "Start", "End", "Coverage") := NULL]
return(copy(DT))
})

mydata <- reduce(mydata, merge, by="Gene Name", all=TRUE)

0
Entering edit mode

This works as well! Thank you.

I had to change mydata to myfiles.

mydata <- imap(myfiles, function(x, y) {