Hi,
I want to ask is it appropriate to generate a tx2gene
file from an ENSEMBL cDNA reference file for bustools input?
I want to insert this code into a larger nextflow kallisto | bustools
scRNA script so it will generate a tx2gene
file for the analysis, flexible to different reference genomes.
#!/usr/bin/env nextflow
params.file = "Homo_sapiens.GRCh38.cdna.all.fa"
Channel
.fromPath(params.file)
.into{ tx2gene; kallisto_index }
process bash{
publishDir "./", mode:'copy'
input:
file ' Homo_sapiens.GRCh38.cdna.all.fa' from tx2gene
output:
file "t2g.txt" into tx2gene_ch
shell:
$/
cat Homo_sapiens.GRCh38.cdna.all.fa | awk '{if($1~/>/)print $1"\t"$4"\t"$7}' > t2g.txt;
sed -i 's/>//g' t2g.txt; sed -i 's/gene://g' t2g.txt; sed -i 's/gene_symbol://g' t2g.txt
/$
'''
}
Perhaps not the cleanest code. The ouput looks like this:
ENST00000434970.2 ENSG00000237235.2 TRDD2
ENST00000415118.1 ENSG00000223997.1 TRDD1
ENST00000448914.1 ENSG00000228985.1 TRDD3
ENST00000631435.1 ENSG00000282253.1 TRBD1
ENST00000632684.1 ENSG00000282431.1 TRBD1
ENST00000390583.1 ENSG00000211923.1 IGHD3-10
ENST00000431440.2 ENSG00000232543.2 IGHD4-11
ENST00000632524.1 ENSG00000282455.1 IGHD7-27
ENST00000633009.1 ENSG00000282323.1 IGHD1-26
ENST00000634070.1 ENSG00000282724.1 IGHD6-25
Comparing this tx2gene
file i generated vs. the one provided by kallisto, there are 23,373 more lines in the kallisto file. Any ideas why this might be the case?
edits: improved code, reformatted post
Thank you for the clarification :)
I am not sure this is 100% correct, I asked Emily_Ensembl to confirm this (at least towards what is in the GTF and cDNA/ncRNA fasta files), lets see if she agrees.
This is correct. GTF is all transcripts, cDNA FASTA is coding only. You could combine the cDNA FASTA with the ncRNA FASTA.
Thank you ATpoint and Emily.
I am still a bit confused as to why 'kallisto assets' would provide a tx2gene file based on an all encompassing GTF file, when the reads are aligned to the cDNA reference? Or am I missing something obvious.
Are they aligned to cDNA? I always quantify against the entire transcriptome (with salmon/alevin though). THere is no strict rule that only coding transcripts must be used. Differential regulation of ncRNAs might/is/can-be biologically meaningful as well. I would exclude small RNAs though before DEG as these are typically not well captured during RNA isolation and library prep in standard (sc)RNA-seq (fall outside of the range of fragments being properly captured = too small).
I just had a quick look at the salmon "getting started" page and they use
Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz
as the reference for the tutorial. I think we mean the same thing when we say transcriptome/cDNA :)I've tried comparing the two files as you suggested using h.sapiens but the difference between the files doesn't make sense. I'll just stick to the kallisto provided files because I can't justify otherwise.
There is a kallisto user group https://groups.google.com/forum/#!forum/kallisto-and-applications where you could ask for expert clarification, just fyi.