Question: bustools transcript to gene file
0
gravatar for Barry Digby
27 days ago by
Barry Digby290
National University of Ireland, Galway
Barry Digby290 wrote:

Hi,

I want to ask is it appropriate to generate a tx2gene file from an ENSEMBL cDNA reference file for bustools input?

I want to insert this code into a larger nextflow kallisto | bustools scRNA script so it will generate a tx2gene file for the analysis, flexible to different reference genomes.

#!/usr/bin/env nextflow

params.file = "Homo_sapiens.GRCh38.cdna.all.fa"
Channel
        .fromPath(params.file)
        .into{ tx2gene; kallisto_index }

process bash{
    publishDir "./", mode:'copy'

    input:
    file ' Homo_sapiens.GRCh38.cdna.all.fa' from tx2gene

    output:
    file "t2g.txt" into tx2gene_ch


    shell:
    $/
    cat Homo_sapiens.GRCh38.cdna.all.fa | awk '{if($1~/>/)print $1"\t"$4"\t"$7}' > t2g.txt;
    sed -i 's/>//g' t2g.txt; sed -i 's/gene://g' t2g.txt; sed -i 's/gene_symbol://g' t2g.txt
    /$
    '''
}

Perhaps not the cleanest code. The ouput looks like this:

ENST00000434970.2   ENSG00000237235.2   TRDD2
ENST00000415118.1   ENSG00000223997.1   TRDD1
ENST00000448914.1   ENSG00000228985.1   TRDD3
ENST00000631435.1   ENSG00000282253.1   TRBD1
ENST00000632684.1   ENSG00000282431.1   TRBD1
ENST00000390583.1   ENSG00000211923.1   IGHD3-10
ENST00000431440.2   ENSG00000232543.2   IGHD4-11
ENST00000632524.1   ENSG00000282455.1   IGHD7-27
ENST00000633009.1   ENSG00000282323.1   IGHD1-26
ENST00000634070.1   ENSG00000282724.1   IGHD6-25

Comparing this tx2gene file i generated vs. the one provided by kallisto, there are 23,373 more lines in the kallisto file. Any ideas why this might be the case?

edits: improved code, reformatted post

ADD COMMENTlink modified 21 days ago • written 27 days ago by Barry Digby290
2
gravatar for ATpoint
27 days ago by
ATpoint29k
Germany
ATpoint29k wrote:

Based on the release notes from kallisto https://github.com/pachterlab/kallisto-transcriptome-indices/releases the transcripts_to_genes.txt was made from the Ensembl GTF annotation file. I am 99.9% sure this contains all annotated transcripts in the genome, so also non-coding RNAs. You used Homo_sapiens.GRCh38.cdna.all.fa which does not contain certain ncRNAs (for this there is a separate fasta file available from Ensembl). I guess this explains the difference and why the file you downloaded has more transcripts. You could double-check by searching for some of the genes/transcripts only present in the downloaded file to see what they are.

I guess that in any case making these tables from the GTF is safest as these contain all annotated transcripts.

ADD COMMENTlink modified 27 days ago • written 27 days ago by ATpoint29k

Thank you for the clarification :)

ADD REPLYlink written 27 days ago by Barry Digby290
1

I am not sure this is 100% correct, I asked Emily_Ensembl to confirm this (at least towards what is in the GTF and cDNA/ncRNA fasta files), lets see if she agrees.

ADD REPLYlink modified 27 days ago • written 27 days ago by ATpoint29k
2

This is correct. GTF is all transcripts, cDNA FASTA is coding only. You could combine the cDNA FASTA with the ncRNA FASTA.

ADD REPLYlink written 26 days ago by Emily_Ensembl20k

Thank you ATpoint and Emily.

I am still a bit confused as to why 'kallisto assets' would provide a tx2gene file based on an all encompassing GTF file, when the reads are aligned to the cDNA reference? Or am I missing something obvious.

ADD REPLYlink written 26 days ago by Barry Digby290

Are they aligned to cDNA? I always quantify against the entire transcriptome (with salmon/alevin though). THere is no strict rule that only coding transcripts must be used. Differential regulation of ncRNAs might/is/can-be biologically meaningful as well. I would exclude small RNAs though before DEG as these are typically not well captured during RNA isolation and library prep in standard (sc)RNA-seq (fall outside of the range of fragments being properly captured = too small).

ADD REPLYlink modified 26 days ago • written 26 days ago by ATpoint29k

I just had a quick look at the salmon "getting started" page and they use Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz as the reference for the tutorial. I think we mean the same thing when we say transcriptome/cDNA :)

I've tried comparing the two files as you suggested using h.sapiens but the difference between the files doesn't make sense. I'll just stick to the kallisto provided files because I can't justify otherwise.

ADD REPLYlink written 26 days ago by Barry Digby290
1

There is a kallisto user group https://groups.google.com/forum/#!forum/kallisto-and-applications where you could ask for expert clarification, just fyi.

ADD REPLYlink written 26 days ago by ATpoint29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1527 users visited in the last hour