How to filter Ensembl cDNA and ncRNA FASTA files by primary assembly?
1
0
Entering edit mode
6.0 years ago

I'm currently performing differential expression analysis using alignment-free quantification using Kallisto. To do this, I need to create a Kallisto index using Ensembl's cDNA and ncRNA annotations available at the following two links:

ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/ncrna/Mus_musculus.GRCm38.ncrna.fa.gz

However, later on in my analysis I realized that I had multiple Ensembl gene IDs pointing to the same gene. For example, I was gettign results for both ENSMUSG00000026012 and ENSMUSG00000102412 which both map to Cd28. I realized that the reason for this error was due to the fact that the annotations I was using contained haplotypes.

I would like to remove these haplotypes, and to do this I believe that I need to rerun my analysis using a primary assembly. However, Ensemble doesn't have annotated cDNA and ncRNA FASTA files of the primary assembly. How can I filter these files to contain only genes present in the primary assembly? Is there a better way for me to solve this issue?

Ensembl • 1.8k views
ADD COMMENT
1
Entering edit mode
6.0 years ago

The FASTA sequences for mouse are indeed available, and for all identified transcripts. Go Here and look under the heading Fasta files.

The equivalent and matching GTFs that you need are also on that page.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 2169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6