Question: Tximport mixing annotations
0
gravatar for Tania
10 months ago by
Tania120
Tania120 wrote:

Hi Everyone

I used the annotation from "GRCh37_latest_genomic.gff" to build tx2gene while my transcripts are from "GRCh37_latest_rna.fna.gz"

txdb <-makeTxDbFromGFF("~/Desktop/work/databases/GRCh37_latest_genomic.gff.gz")
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID")
tx2gene <- df[, 2:1]

I am getting this error, aren't those files correspond to each other?

Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.
  
rna-seq txdb tximport • 737 views
ADD COMMENTlink written 10 months ago by Tania120

Hi Tania,

have you tried this command on an unzipped GFF ? Cheers, Michael

ADD REPLYlink written 10 months ago by michael.ante2.6k

Tried that now, doesn't work too. Thanks though !

ADD REPLYlink written 10 months ago by Tania120

Looking at both of these files, the IDs from the FASTA file that you used are in RefSeq format. Thus, you may have to specify

df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "Name")

or

df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "transcript_id")

I have used tximport in the past but I always curate my own annotation reference. I always use the GENCODE transcripts for alignment and annotation, as these contain HGNC IDs in the FASTA headers.


head GRCh37_latest_rna.fna

NM_000014.5 Homo sapiens alpha-2-macroglobulin (A2M), transcript variant 1, mRNA

ATACAAGAGATGTGAGAAGCACCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGG TTGAAAAGCTTATTAGCTGCTGTACGGTAAAAGTGAGCTCTTACGGGAATGGGAATGTAGTTTTAGCCCT CCAGGGATTCTATTTAGCCCGCCAGGAATTAACCTTGACTATAAATAGGCCATCAATGACCTTTCCAGAG AATGTTCAGAGACCTCAACTTTGTTTAGAGATCTTGTGTGGGTGGAACTTCCTGTTTGCACACAGAGCAG CATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTC TGTTCTCCTCCAGCTCCTTCTTTCTGCAACATGGGGAAGAACAAACTCCTTCATCCAAGTCTGGTTCTTC TCCTCTTGGTCCTCCTGCCCACAGACGCCTCAGTCTCTGGAAAACCGCAGTATATGGTTCTGGTCCCCTC

grep -e "NM_000014.5" GRCh37_latest_genomic.gff | head

NC_000012.11 BestRefSeq mRNA 9220304 9268825 . - . ID=rna36219;Parent=gene24947;Dbxref=GeneID:2,Genbank:NM_000014.5,HGNC:HGNC:7,MIM:103950;Name=NM_000014.5;Note=The RefSeq transcript has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=A2M;product=alpha-2-macroglobulin%2C transcript variant 1;transcript_id=NM_000014.5

NC_000012.11 BestRefSeq exon 9268360 9268825 . - . ID=id345908;Parent=rna36219;Dbxref=GeneID:2,Genbank:NM_000014.5,HGNC:HGNC:7,MIM:103950;Note=The RefSeq transcript has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=A2M;product=alpha-2-macroglobulin%2C transcript variant 1;transcript_id=NM_000014.5

NC_000012.11 BestRefSeq exon 9265956 9266139 . - . ID=id345909;Parent=rna36219;Dbxref=GeneID:2,Genbank:NM_000014.5,HGNC:HGNC:7,MIM:103950;Note=The RefSeq transcript has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=A2M;product=alpha-2-macroglobulin%2C transcript variant 1;transcript_id=NM_000014.5

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe28k

When I search for the transcripts manually in the quantifications , they do exist in tx2gene, but not sure why tximport did not see them. I think I will try re-annotating with Genecode and see if this will work.

Thanks

ADD REPLYlink written 10 months ago by Tania120
1

Okay, if you run into issues, then you can just download my own GENCODE tximport file from here.

I normally used this like:

TranscriptToGeneMap <- read.table("Library/TranscriptsTxImport.tsv", header=TRUE, sep="\t", skip=0, stringsAsFactors=FALSE)

#Read in the counts
#   txIn, input counts are transcript level?
#   txOut, output counts are transcript level?
txi <- tximport(files, type="kallisto", txIn=TRUE, txOut=FALSE, tx2gene=TranscriptToGeneMap, countsFromAbundance="no") #, reader=fread)

When I last used this, I was not aware of the makeTxDbFromGFF function, but it looks useful and may have been added since I first used tximport (2 year ago).

ADD REPLYlink written 10 months ago by Kevin Blighe28k
1

Thanks Kevin so much. I will try this and see hopefully

ADD REPLYlink written 10 months ago by Tania120

Hi @Kevin, The output looks like this, what is protein coding or rna associated with the gene. Do you have explanation? Sorry don't have too much biology background.

ADH7.protein_coding",-9.35112818723107,5.08564645523806,7.26455080231657e-24,8.23800060982699e-21
"TNR.protein_coding",6.76699157467661,5.56020033264359,5.98150908346311e-22,3.39151565032358e-19
"SCN3B.protein_coding",5.68195884233326,6.04515812873865,2.17776757060652e-21,8.23196141689263e-19
"CECR2.protein_coding",4.56376413767869,4.94697309522689,9.84755015258283e-21,2.79178046825723e-18
"SNORA70.snoRNA",2.46479498252603,8.222964993941,3.74758538408327e-08,1.41658727518348e-06
"RN7SKP106.misc_RNA",7.02192630891482,1.41996667770098,3.87278897378951e-08,1.41669119234752e-06
"CSRP1.protein_coding",-1.83461664661011,8.23257557071387,4.84852163343817e-08,1.71819485384965e-06
ADD REPLYlink modified 10 months ago • written 10 months ago by Tania120
1

Oh, well, the GENCODE files to which I pointed you contain all of the RNA genes identified by the ENCODE project, which totals up to ~200,000 different RNAs. Only ~21,000 of these RNAs are actually further translated into proteins; hence, the 'protein_coding' suffix. These protein coding RNA genes comprise the most well known genes, such as the HOX gene group, TP53, BRCA1, CD20, CD79A/MS4A1, CR2, etc.

All of the other genes do not encode for a protein but may still have functionality, i.e., non-coding RNAs (ncRNAs). Well-known ones include XIST, a non-coding RNA which in females helps to deactivate one X chromosome through epigenetic silencing. Many other ncRNAs are transcribed from the antisense strand of another RNA and may interfere with the expression of the sense RNA. TSIX, for example, is transcribed on the opposite strand of XIST.

There is further information on all 'biotypes' in GENCODE here: https://www.gencodegenes.org/gencode_biotypes.html

You will most likely not be interested in all of these. For example, there are ~40,000 pseudogenes alone.

ADD REPLYlink written 10 months ago by Kevin Blighe28k
1

excellent ! Thanks Kevin a lot

ADD REPLYlink written 10 months ago by Tania120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1324 users visited in the last hour