Question

bustools transcript to gene file

0

Entering edit mode

4.3 years ago

Barry Digby ★ 1.3k

Hi,

I want to ask is it appropriate to generate a tx2gene file from an ENSEMBL cDNA reference file for bustools input?

I want to insert this code into a larger nextflow kallisto | bustools scRNA script so it will generate a tx2gene file for the analysis, flexible to different reference genomes.

#!/usr/bin/env nextflow

params.file = "Homo_sapiens.GRCh38.cdna.all.fa"
Channel
        .fromPath(params.file)
        .into{ tx2gene; kallisto_index }

process bash{
    publishDir "./", mode:'copy'

    input:
    file ' Homo_sapiens.GRCh38.cdna.all.fa' from tx2gene

    output:
    file "t2g.txt" into tx2gene_ch


    shell:
    $/
    cat Homo_sapiens.GRCh38.cdna.all.fa | awk '{if($1~/>/)print $1"\t"$4"\t"$7}' > t2g.txt;
    sed -i 's/>//g' t2g.txt; sed -i 's/gene://g' t2g.txt; sed -i 's/gene_symbol://g' t2g.txt
    /$
    '''
}

Perhaps not the cleanest code. The ouput looks like this:

ENST00000434970.2   ENSG00000237235.2   TRDD2
ENST00000415118.1   ENSG00000223997.1   TRDD1
ENST00000448914.1   ENSG00000228985.1   TRDD3
ENST00000631435.1   ENSG00000282253.1   TRBD1
ENST00000632684.1   ENSG00000282431.1   TRBD1
ENST00000390583.1   ENSG00000211923.1   IGHD3-10
ENST00000431440.2   ENSG00000232543.2   IGHD4-11
ENST00000632524.1   ENSG00000282455.1   IGHD7-27
ENST00000633009.1   ENSG00000282323.1   IGHD1-26
ENST00000634070.1   ENSG00000282724.1   IGHD6-25

Comparing this tx2gene file i generated vs. the one provided by kallisto, there are 23,373 more lines in the kallisto file. Any ideas why this might be the case?

edits: improved code, reformatted post

bustools kallisto scRNA nextflow ensembl • 2.7k views

ADD COMMENT • link 4.2 years ago by Barry Digby ★ 1.3k

score 2 · Accepted Answer · 2020-01-21

2

Entering edit mode

4.3 years ago

ATpoint 82k

Based on the release notes from kallisto https://github.com/pachterlab/kallisto-transcriptome-indices/releases the transcripts_to_genes.txt was made from the Ensembl GTF annotation file. I am 99.9% sure this contains all annotated transcripts in the genome, so also non-coding RNAs. You used Homo_sapiens.GRCh38.cdna.all.fa which does not contain certain ncRNAs (for this there is a separate fasta file available from Ensembl). I guess this explains the difference and why the file you downloaded has more transcripts. You could double-check by searching for some of the genes/transcripts only present in the downloaded file to see what they are.

I guess that in any case making these tables from the GTF is safest as these contain all annotated transcripts.

ADD COMMENT • link 4.3 years ago by ATpoint 82k

0

Entering edit mode

Thank you for the clarification :)

ADD REPLY • link 4.3 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

I am not sure this is 100% correct, I asked Emily_Ensembl to confirm this (at least towards what is in the GTF and cDNA/ncRNA fasta files), lets see if she agrees.

ADD REPLY • link 4.3 years ago by ATpoint 82k

2

Entering edit mode

This is correct. GTF is all transcripts, cDNA FASTA is coding only. You could combine the cDNA FASTA with the ncRNA FASTA.

ADD REPLY • link 4.3 years ago by Emily 23k

0

Entering edit mode

Thank you ATpoint and Emily.

I am still a bit confused as to why 'kallisto assets' would provide a tx2gene file based on an all encompassing GTF file, when the reads are aligned to the cDNA reference? Or am I missing something obvious.

ADD REPLY • link 4.3 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

Are they aligned to cDNA? I always quantify against the entire transcriptome (with salmon/alevin though). THere is no strict rule that only coding transcripts must be used. Differential regulation of ncRNAs might/is/can-be biologically meaningful as well. I would exclude small RNAs though before DEG as these are typically not well captured during RNA isolation and library prep in standard (sc)RNA-seq (fall outside of the range of fragments being properly captured = too small).

ADD REPLY • link 4.3 years ago by ATpoint 82k

0

Entering edit mode

I just had a quick look at the salmon "getting started" page and they use Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz as the reference for the tutorial. I think we mean the same thing when we say transcriptome/cDNA :)

I've tried comparing the two files as you suggested using h.sapiens but the difference between the files doesn't make sense. I'll just stick to the kallisto provided files because I can't justify otherwise.

ADD REPLY • link 4.3 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

There is a kallisto user group https://groups.google.com/forum/#!forum/kallisto-and-applications where you could ask for expert clarification, just fyi.

ADD REPLY • link 4.3 years ago by ATpoint 82k