Question

How tximport work with gencode transcripts?

1

Entering edit mode

5.8 years ago

Sharon ▴ 600

Hi All

How to use tximport with gencode?

I used Gencode transcripts for mouse looks like this:

>ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|
 AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
 AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT

Salmon quantification is like this;

Name    Length  EffectiveLength TPM     NumReads
ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|   1070    489.786 0.536463        7.4771
ENSMUST00000082908.1|ENSMUSG00000064842.1|-|-|Gm26206-201|Gm26206|110|snRNA|    110     28      0       0

I use this code to generate tx2gene:

library(TxDb.Mmusculus.UCSC.mm10.ensGene) 
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene)
columns(txdb)
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID") 
tx2gene <- df[, 2:1]
head(tx2gene)
write.csv(tx2gene, 'MouseEnstx2gene.csv')

The MouseEnstx2gene.csv looks like this and this causes problem in further analysis:

"","TXNAME","GENEID"
"1","ENSMUST00000000001","ENSMUSG00000000001"
"2","ENSMUST00000000003","ENSMUSG00000000003"

I feel there is a tiny thing I am missing here that could fix this as it is not the same format as Salmon quantification ! Any hint? I just tried ignoreAfterBar from tximport and it doesn't work !

Error in tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE) : 
  unused argument (ignoreAfterBar = TRUE)

Thanks

RNA-Seq R • 4.6k views

ADD COMMENT • link updated 2.7 years ago by Kevin Blighe 87k • written 5.8 years ago by Sharon ▴ 600

score 3 · Accepted Answer · 2018-06-22

Hi Sharon,

I'm not sure that you even need the ignoreAfterBar parameter. If using the tx2gene parameter, though, then the first column of your tx2gene object has to be the exact transcript name as in the Salmon files. The second column, then, is what you wan to convert these to.

For example, if you save the Salmon transcripts to a character vector (here, salmonCounts is just any file output by Salmon):

transcriptsSalmon <- as.character(salmonCounts$Name)

head(transcriptsSalmon, 10)                                                                                                                                          
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|
ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|
ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|RP11-34P13.4-001|FAM138A|1187|lincRNA|
ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|RP11-34P13.4-002|FAM138A|590|lincRNA|
ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|

We can then parse this and extract whatever information we want, information which will eventually be used as the rownames of our future txi object created by tximport via tx2gene:

tx2gene <- data.frame(
  transcriptsSalmon,
  do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6])
colnames(tx2gene) <- c("transcript_id", "gene_id")
head(tx2gene, 10)

transcript_id                                                                          gene_id
ENST00000456328.2|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|              DDX11L1
ENST00000450305.2|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| DDX11L1
ENST00000488147.1|...|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|             WASH7P
ENST00000619216.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|                                MIR6859-1
ENST00000473358.1|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|                       RP11-34P13.3
ENST00000469289.1|..|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|                        RP11-34P13.3
ENST00000607096.1|...|MIR1302-2-201|MIR1302-2|138|miRNA|                               MIR1302-2
ENST00000417324.1|...|RP11-34P13.4-001|FAM138A|1187|lincRNA|                           FAM138A
ENST00000461467.1|...|RP11-34P13.4-002|FAM138A|590|lincRNA|                            FAM138A
ENST00000606857.1|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|             OR4G4P

The pitfall of using just the gene name is that it is not unique, so, will result in issues. You could also merge fields:

tx2gene <- data.frame(transcriptsSalmon,
  paste(do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,1],
  do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6], sep="_"))
colnames(tx2gene) <- c("transcript_id", "gene_id")
tx2gene$gene_id[1:20]

 [1] ENST00000456328.2_DDX11L1      ENST00000450305.2_DDX11L1     
 [3] ENST00000488147.1_WASH7P       ENST00000619216.1_MIR6859-1   
 [5] ENST00000473358.1_RP11-34P13.3 ENST00000469289.1_RP11-34P13.3
 [7] ENST00000607096.1_MIR1302-2    ENST00000417324.1_FAM138A     
 [9] ENST00000461467.1_FAM138A      ENST00000606857.1_OR4G4P      
[11] ENST00000642116.1_OR4G11P      ENST00000492842.2_OR4G11P     
[13] ENST00000641515.2_OR4F5        ENST00000335137.4_OR4F5       
[15] ENST00000466430.5_RP11-34P13.7 ENST00000477740.5_RP11-34P13.7
[17] ENST00000471248.1_RP11-34P13.7 ENST00000610542.1_RP11-34P13.7
[19] ENST00000453576.2_RP11-34P13.7 ENST00000495576.1_RP11-34P13.8

Does that make sense?

Kevin