bustools transcript to gene file
1
0
Entering edit mode
4.2 years ago
Barry Digby ★ 1.3k

Hi,

I want to ask is it appropriate to generate a tx2gene file from an ENSEMBL cDNA reference file for bustools input?

I want to insert this code into a larger nextflow kallisto | bustools scRNA script so it will generate a tx2gene file for the analysis, flexible to different reference genomes.

#!/usr/bin/env nextflow

params.file = "Homo_sapiens.GRCh38.cdna.all.fa"
Channel
        .fromPath(params.file)
        .into{ tx2gene; kallisto_index }

process bash{
    publishDir "./", mode:'copy'

    input:
    file ' Homo_sapiens.GRCh38.cdna.all.fa' from tx2gene

    output:
    file "t2g.txt" into tx2gene_ch


    shell:
    $/
    cat Homo_sapiens.GRCh38.cdna.all.fa | awk '{if($1~/>/)print $1"\t"$4"\t"$7}' > t2g.txt;
    sed -i 's/>//g' t2g.txt; sed -i 's/gene://g' t2g.txt; sed -i 's/gene_symbol://g' t2g.txt
    /$
    '''
}

Perhaps not the cleanest code. The ouput looks like this:

ENST00000434970.2   ENSG00000237235.2   TRDD2
ENST00000415118.1   ENSG00000223997.1   TRDD1
ENST00000448914.1   ENSG00000228985.1   TRDD3
ENST00000631435.1   ENSG00000282253.1   TRBD1
ENST00000632684.1   ENSG00000282431.1   TRBD1
ENST00000390583.1   ENSG00000211923.1   IGHD3-10
ENST00000431440.2   ENSG00000232543.2   IGHD4-11
ENST00000632524.1   ENSG00000282455.1   IGHD7-27
ENST00000633009.1   ENSG00000282323.1   IGHD1-26
ENST00000634070.1   ENSG00000282724.1   IGHD6-25

Comparing this tx2gene file i generated vs. the one provided by kallisto, there are 23,373 more lines in the kallisto file. Any ideas why this might be the case?

edits: improved code, reformatted post

bustools kallisto scRNA nextflow ensembl • 2.7k views
ADD COMMENT
2
Entering edit mode
4.2 years ago
ATpoint 81k

Based on the release notes from kallisto https://github.com/pachterlab/kallisto-transcriptome-indices/releases the transcripts_to_genes.txt was made from the Ensembl GTF annotation file. I am 99.9% sure this contains all annotated transcripts in the genome, so also non-coding RNAs. You used Homo_sapiens.GRCh38.cdna.all.fa which does not contain certain ncRNAs (for this there is a separate fasta file available from Ensembl). I guess this explains the difference and why the file you downloaded has more transcripts. You could double-check by searching for some of the genes/transcripts only present in the downloaded file to see what they are.

I guess that in any case making these tables from the GTF is safest as these contain all annotated transcripts.

ADD COMMENT
0
Entering edit mode

Thank you for the clarification :)

ADD REPLY
1
Entering edit mode

I am not sure this is 100% correct, I asked Emily_Ensembl to confirm this (at least towards what is in the GTF and cDNA/ncRNA fasta files), lets see if she agrees.

ADD REPLY
2
Entering edit mode

This is correct. GTF is all transcripts, cDNA FASTA is coding only. You could combine the cDNA FASTA with the ncRNA FASTA.

ADD REPLY
0
Entering edit mode

Thank you ATpoint and Emily.

I am still a bit confused as to why 'kallisto assets' would provide a tx2gene file based on an all encompassing GTF file, when the reads are aligned to the cDNA reference? Or am I missing something obvious.

ADD REPLY
0
Entering edit mode

Are they aligned to cDNA? I always quantify against the entire transcriptome (with salmon/alevin though). THere is no strict rule that only coding transcripts must be used. Differential regulation of ncRNAs might/is/can-be biologically meaningful as well. I would exclude small RNAs though before DEG as these are typically not well captured during RNA isolation and library prep in standard (sc)RNA-seq (fall outside of the range of fragments being properly captured = too small).

ADD REPLY
0
Entering edit mode

I just had a quick look at the salmon "getting started" page and they use Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz as the reference for the tutorial. I think we mean the same thing when we say transcriptome/cDNA :)

I've tried comparing the two files as you suggested using h.sapiens but the difference between the files doesn't make sense. I'll just stick to the kallisto provided files because I can't justify otherwise.

ADD REPLY
1
Entering edit mode

There is a kallisto user group https://groups.google.com/forum/#!forum/kallisto-and-applications where you could ask for expert clarification, just fyi.

ADD REPLY

Login before adding your answer.

Traffic: 1498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6