Hi all,
I think this is probably a very simple question, but i have encounter some problems with data analysis when I have multiple conditions in one data set. Below I present one example of a DEXSeq analysis, but the same problem occurs also when doing a DESeq analysis with more than one condition in the data set.
In my workflow I read all the files into R
BaseDir = "/home/USER/dexseq"
countFiles = list.files(BaseDir, pattern="MW.*.txt$", full.names=TRUE)
countFiles
[1] "/home/USER/dexseq/MW10.SE.txt"
[2] "/home/USER/dexseq/MW11.SE.txt"
[3] "/home/USER/dexseq/MW12.PE.PE.txt"
[4] "/home/USER/dexseq/MW12.SE.txt"
[5] "/home/USER/dexseq/MW13.SE.txt"
[6] "/home/USER/dexseq/MW14.SE.txt"
...
[21] "/home/USER/dexseq/MW7.SE.txt"
[22] "/home/USER/dexseq/MW8.PE.PE.txt"
[23] "/home/USER/dexseq/MW8.SE.txt"
[24] "/home/USER/dexseq/MW9.SE.txt"
my metadata file though is sorted based on condition
sample.Names ShortName condition libraryType
MW1.SE.txt MW1 ES singleEnd
MW8.PE.PE.txt MW8 ES PairedEnd
MW8.SE.txt MW8 ES singleEnd
MW16.SE.txt MW16 ES singleEnd
MW19.SE.txt MW19 ES singleEnd
MW7.SE.txt MW7 EB9 singleEnd
MW15.PE.PE.txt MW15 EB9 PairedEnd
MW15.SE.txt MW15 EB9 singleEnd
MW6.SE.txt MW6 EB8 singleEnd
...
MW10.SE.txt MW10 EB4 singleEnd
MW9.SE.txt MW9 EB3 singleEnd
MW18.SE.txt MW18 EB3 singleEnd
MW21.SE.txt MW21 EB3 singleEnd
MW17.SE.txt MW17 EB2 singleEnd
MW20.SE.txt MW20 EB2 singleEnd
When I try to compare for example my control (ES) with EB2 I do as follow
metaData <- read_tsv("metadata.txt")
metaData <- metaData[order(metaData$condition, decreasing = TRUE),]
EB2.ES <- subset(metaData, subset = metaData$condition %in% c("EB2", "ES"))
sampleTable <- data.frame(row.names= EB2.ES$sample.Names, condition= EB2.ES$condition, lib.type=EB2.ES$libraryType)
> sampleTable
condition lib.type
MW16.SE.txt ES singleEnd
MW19.SE.txt ES singleEnd
MW1.SE.txt ES singleEnd
MW8.PE.PE.txt ES PairedEnd
MW8.SE.txt ES singleEnd
MW17.SE.txt EB2 singleEnd
MW20.SE.txt EB2 singleEnd
counts1<- countFiles[basename(countFiles) %in% row.names(sampleTable)]
counts1
[1] "/home/USER/dexseq/MW16.SE.txt"
[2] "/home/USER/dexseq/MW17.SE.txt"
[3] "/home/USER/dexseq/MW19.SE.txt"
[4] "/home/USER/dexseq/MW1.SE.txt"
[5] "/home/USER/dexseq/MW20.SE.txt"
[6] "/home/USER/dexseq/MW8.PE.PE.txt"
[7] "/home/USER/dexseq/MW8.SE.txt"
dxd1 = DEXSeqDataSetFromHTSeq(
counts1,
sampleData=sampleTable,
design= ~ sample + exon + condition:exon,
flattenedfile=flattenedFile )
As you can see, the order of the samples in the sampleTable, which i would like to use as sampleData
is not identical to the order of the count files in counts1
.
Is there an automatic way to ensure that these two object have the same files in a similar order?
Can I somehow subset the complete list of count files and/or the metadata to make sure, that they are still the same files?
thanks
Assa
The whole point of making a sample table is to not run into this problem (e.g., by using the DESeqDatasetFromHTseq function in DESeq2). If you're going to create the count matrix yourself then you're responsible for ensuring that it's in the appropriate order.