Question: How tximport work with gencode transcripts?
0
gravatar for Sharon
4 months ago by
Sharon340
Sharon340 wrote:

Hi All

How to use tximport with gencode?

I used Gencode transcripts for mouse looks like this:

>ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|
 AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
 AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT

Salmon quantification is like this;

Name    Length  EffectiveLength TPM     NumReads
ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|RP23-271O17.1-001|RP23-271O17.1|1070|TEC|   1070    489.786 0.536463        7.4771
ENSMUST00000082908.1|ENSMUSG00000064842.1|-|-|Gm26206-201|Gm26206|110|snRNA|    110     28      0       0

I use this code to generate tx2gene:

library(TxDb.Mmusculus.UCSC.mm10.ensGene) 
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene)
columns(txdb)
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID") 
tx2gene <- df[, 2:1]
head(tx2gene)
write.csv(tx2gene, 'MouseEnstx2gene.csv')

The MouseEnstx2gene.csv looks like this and this causes problem in further analysis:

"","TXNAME","GENEID"
"1","ENSMUST00000000001","ENSMUSG00000000001"
"2","ENSMUST00000000003","ENSMUSG00000000003"

I feel there is a tiny thing I am missing here that could fix this as it is not the same format as Salmon quantification ! Any hint? I just tried ignoreAfterBar from tximport and it doesn't work !

Error in tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE) : 
  unused argument (ignoreAfterBar = TRUE)

Thanks

rna-seq R • 413 views
ADD COMMENTlink modified 4 months ago by Kevin Blighe32k • written 4 months ago by Sharon340
2
gravatar for Kevin Blighe
4 months ago by
Kevin Blighe32k
Republic of Ireland
Kevin Blighe32k wrote:

Hi Sharon,

I'm not sure that you even need the ignoreAfterBar parameter. If using the tx2gene parameter, though, then the first column of your tx2gene object has to be the exact transcript name as in the Salmon files. The second column, then, is what you wan to convert these to.

For example, if you save the Salmon transcripts to a character vector (here, salmonCounts is just any file output by Salmon):

transcriptsSalmon <- as.character(salmonCounts$Name)

head(transcriptsSalmon, 10)                                                                                                                                          
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|
ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|
ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|RP11-34P13.4-001|FAM138A|1187|lincRNA|
ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|RP11-34P13.4-002|FAM138A|590|lincRNA|
ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|

We can then parse this and extract whatever information we want, information which will eventually be used as the rownames of our future txi object created by tximport via tx2gene:

tx2gene <- data.frame(transcriptsSalmon, do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6])
colnames(tx2gene) <- c("transcript_id", "gene_id")
head(tx2gene, 10)

transcript_id                                                                          gene_id
ENST00000456328.2|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|              DDX11L1
ENST00000450305.2|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene| DDX11L1
ENST00000488147.1|...|RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|             WASH7P
ENST00000619216.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|                                MIR6859-1
ENST00000473358.1|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA|                       RP11-34P13.3
ENST00000469289.1|..|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA|                        RP11-34P13.3
ENST00000607096.1|...|MIR1302-2-201|MIR1302-2|138|miRNA|                               MIR1302-2
ENST00000417324.1|...|RP11-34P13.4-001|FAM138A|1187|lincRNA|                           FAM138A
ENST00000461467.1|...|RP11-34P13.4-002|FAM138A|590|lincRNA|                            FAM138A
ENST00000606857.1|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|             OR4G4P

The pitfall of using just the gene name is that it is not unique, so, will result in issues. You could also merge fields:

tx2gene <- data.frame(transcriptsSalmon,
  paste(do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,1],
  do.call(rbind, strsplit(transcriptsSalmon, "\\|"))[,6], sep="_"))
colnames(tx2gene) <- c("transcript_id", "gene_id")
tx2gene$gene_id[1:20]

 [1] ENST00000456328.2_DDX11L1      ENST00000450305.2_DDX11L1     
 [3] ENST00000488147.1_WASH7P       ENST00000619216.1_MIR6859-1   
 [5] ENST00000473358.1_RP11-34P13.3 ENST00000469289.1_RP11-34P13.3
 [7] ENST00000607096.1_MIR1302-2    ENST00000417324.1_FAM138A     
 [9] ENST00000461467.1_FAM138A      ENST00000606857.1_OR4G4P      
[11] ENST00000642116.1_OR4G11P      ENST00000492842.2_OR4G11P     
[13] ENST00000641515.2_OR4F5        ENST00000335137.4_OR4F5       
[15] ENST00000466430.5_RP11-34P13.7 ENST00000477740.5_RP11-34P13.7
[17] ENST00000471248.1_RP11-34P13.7 ENST00000610542.1_RP11-34P13.7
[19] ENST00000453576.2_RP11-34P13.7 ENST00000495576.1_RP11-34P13.8

Does that make sense?

Kevin

ADD COMMENTlink modified 4 months ago • written 4 months ago by Kevin Blighe32k
1

A comprehensive answer from you, as usual. Nice to have you around, Sir!

ADD REPLYlink written 4 months ago by ATpoint9.2k

Thanks bro. Have to empty the contents of my head before I'm too old to remember anything!

ADD REPLYlink written 4 months ago by Kevin Blighe32k

Hi Kevin Thanks so much and sorry for the late reply, was away. Yes, this makes sense. I understand where the problem came from but did not how to solve, am still learning R. I will try that. Thanks so much :)

ADD REPLYlink written 4 months ago by Sharon340
1

No problem. Trust that you are not working too much this weekend.

ADD REPLYlink written 4 months ago by Kevin Blighe32k
1

work doesn't leave me =D I just switched to something else :) :)

ADD REPLYlink written 4 months ago by Sharon340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1571 users visited in the last hour