Create table of gene expression for multiple samples
2
0
Entering edit mode
8 months ago

I have 60 samples and my PI wants a table (excel spreadsheet) containing all expressed genes in 60 samples. I have used hisat and stringtie. I don't know how to create such a table unless I code something and my coding skills are rudimentary. Or that I manually extract the expression values and paste them into an excel file 60 times using vlookup.

Is there a tool that I can use for this?

Additional information: I have geneabundance.tab files and .gtf files from stringtie.

Geneabundance.tab

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000228794.9   LINC01128   chr1    +   860226  868202  0.000000    0.000000    0.000000
ENSG00000230368.2   FAM41C  chr1    -   868071  876903  0.000000    0.000000    0.000000
ENSG00000234711.1   TUBB8P11    chr1    +   873292  874349  0.108696    0.021078    0.030533

.gtf file

chr1    HAVANA  transcript  860226  866720  .   +   .   gene_id "ENSG00000228794.9"; transcript_id "ENST00000671208.1"; ref_gene_name "LINC01128"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    HAVANA  exon    860226  861273  .   +   .   gene_id "ENSG00000228794.9"; transcript_id "ENST00000671208.1"; exon_number "1"; ref_gene_name "LINC01128"; cov "0.0";
RNA-Seq stringtie hisat • 434 views
ADD COMMENT
0
Entering edit mode

You will need to tell us what format the data you currently have is.

ADD REPLY
0
Entering edit mode

I have geneabundance.tab and .gtf files from stringtie (updated in original post)

ADD REPLY
0
Entering edit mode

Just show us a snippet of each file. The files should be tabular/delimited anyway most likely. What are you defining as 'expressed' in this data?

ADD REPLY
0
Entering edit mode

Hopefully it is clear, there aren't commas in the real file, added them to make it a bit more clearer. By expressed the PI wants anything 0 and above (so all genes, sorry for the confusion).

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

You should not need to add anything that is not present in the original data in that case.

ADD REPLY
0
Entering edit mode

Thank you, fixed it up now

ADD REPLY
0
Entering edit mode

You can either use featurecounts or multiBamSummary from deeptools to generate the gene expression table.

ADD REPLY
0
Entering edit mode

Thanks, I'll give it a crack

ADD REPLY
0
Entering edit mode

Are all the tab files in the same folder? If not, how are the files stored? It's important to be specific since the code would need to loop through the files.

ADD REPLY
0
Entering edit mode

I have both, I thought that having each file in the same directory might make it easier. I'll be sure to clarify in future posts.

ADD REPLY
0
Entering edit mode

See if the answer here is of any help:

A: Unable to create file for tximport

ADD REPLY
2
Entering edit mode
8 months ago

In R, some general guidance:

1,obtain a list of all files

myfiles <- list.files(...)

2, use lapply() to read all files into a list

mydata <- lapply(myfiles, function(x) {
  read.table(x, sep = '\t', dec = '.',
    header = TRUE, row.names = 1, stringsAsFactors = FALSE)
})
names(mydata) <- myfiles

3, summarise the data into a single matrix

mydata_tpm <- lapply(mydata, function(x) data.frame(TPM = x[['TPM']]))
mat <- do.call(cbind, mydata_tpm)

I trust that you can fill in everything else.

Kevin

ADD COMMENT
0
Entering edit mode

By the way, you can pass arguments that would go to the function (argument: FUN) directly to lapply:

lapply(myfiles, read.table, sep = '\t', dec = '.', header = TRUE, row.names = 1, stringsAsFactors = FALSE)
ADD REPLY
0
Entering edit mode

Didn't know that - thanks!

ADD REPLY
0
Entering edit mode

Thank you! I have given it a try, however I have run into 2 issues.

Firstly, when running the below I get duplicate 'row.names' are not allowed and I think the this comes down to row.names=1. I checked the files to see if there were any duplicate Gene IDs and there weren't (cat file.tab | cut -d $'\t' -f 1 | sort | uniq -d). I changed it to row.names=NULL and it worked. I think the issue is because there's an extra column in the header or first row.

mydata <- lapply(myfiles, function(x) {read.table(x, sep = '\t', dec = '.', header = TRUE, row.names = 1, stringsAsFactors = FALSE)})

Secondly, mat <- do.call(cbind, mydata_tpm) produces the below error.

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 60625, 60622, 60621, 60620, 60626, 60624, 60618, 60623, 60619, 60627

After having a look at the number of lines in each geneabundance.tab file shows that they have different number of lines. Additionally, I assume that since there are different number of genes then there are different genes in each .tab file. I will have to account for that then somehow.

Sorry if this is frazzled, I have spent the last couple hours trying to find the solution.

ADD REPLY
2
Entering edit mode
8 months ago

Try this. It accounts for the files having different numbers of rows.

library("data.table")
library("purrr")

myfiles <- setNames(list.files(...), basename(list.files(...))

mydata <- imap(myfiles, function(x, y) {
  DT <- fread(x)
  setnames(DT, old="TPM", new=str_c("TPM_", y))
  DT[, c("Gene ID", "Reference", "Strand", "Start", "End", "Coverage") := NULL]
  return(copy(DT))
})

mydata <- reduce(mydata, merge, by="Gene Name", all=TRUE)
ADD COMMENT
0
Entering edit mode

This works as well! Thank you.

I had to change mydata to myfiles.

mydata <- imap(myfiles, function(x, y) {
ADD REPLY

Login before adding your answer.

Traffic: 2007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6