I typically download cdnas directly from Ensembl (like with
wget ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz), build a
kallisto index, and run
kallisto quant to estimate isoform abundance.
However, Ensembl tends to provide very detailed transcript models. Furthermore, the provided cdna files from Ensembl also contain lots of non-coding biotypes from NMD to retained intron.
So I was wondering if a better practice would be filtering the provided cdna.fasta for just those transcripts with a CCDS id or filtering by biotype (such as "protein coding")?
As an example a ccds-filter would cut down the number of cdnas of https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000077782 from 41 to 9.
How sensitive is kallisto with respect to overly complex/redundant gene architectures?