Should cndas provided by Ensembl be filtered by ccds or biotype prior to running kallisto?

0

Entering edit mode

7.1 years ago

holgerbrandl ▴ 30

I typically download cdnas directly from Ensembl (like with wget ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz), build a kallisto index, and run kallisto quant to estimate isoform abundance.

However, Ensembl tends to provide very detailed transcript models. Furthermore, the provided cdna files from Ensembl also contain lots of non-coding biotypes from NMD to retained intron.

So I was wondering if a better practice would be filtering the provided cdna.fasta for just those transcripts with a CCDS id or filtering by biotype (such as "protein coding")?

As an example a ccds-filter would cut down the number of cdnas of https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000077782 from 41 to 9.

How sensitive is kallisto with respect to overly complex/redundant gene architectures?

kallisto isoforms • 1.4k views

ADD COMMENT • link 7.1 years ago by holgerbrandl ▴ 30

Login before adding your answer.