Question

Ensembl assembly choice for RNA-Seq reference mapping

0

Entering edit mode

6.2 years ago

eoin ▴ 30

Hello,

Just a general question regarding choice of assembly version for RNA-Seq mapping. I have a bos taurus RNA-seq dataset, and my pipeline uses STAR for reference assembly alignment with featureCounts to count the genes. End goal is differential gene expression analysis, GO/Pathway etc.

The Star developers suggest to use an assembly file that excludes patches and haplotypes, and in Ensembl these are the "..dna.primary.assembly.fa.gz" files. For the current B. taurus assembly release (ARS-UCD1.2) there is no combined "primary.assembly.fa.gz" file for the whole reference, only separate files for each chromosome (from what I can see, all <100mb in size).

So my question - in the absence of the combined primary assembly file for the whole reference genome, can I

a) just CAT all the chromosome files into one big fasta file and proceed with that - not sure if this will prevent me using the associated .gtf annotation file downstream?
b) Should I rely on the "toplevel.fa" file (~800mb, around expected size)

I realise similar questions have been asked here (e.g. How to select a gene annotation file in ensembl that contain un-localized scaffolds, but NO patches or haplotypes? ) but I'm interested in this particular case (i.e. bovine assembly with no complete primary assembly file). For example, the human reference assembly does have a primary assembly of the whole genome. the current dataset I'm working on is also poor quality due to the wet lab approach (no mRNA enrichment), so just want to validate I'm taking the best approach.

Thanks a lot

RNA-Seq ensembl STAR alignment Assembly • 1.5k views

ADD COMMENT • link updated 14 months ago by AHerik ▴ 20 • written 6.2 years ago by eoin ▴ 30

0

Entering edit mode

Hi! Was there ever a consensus on what should be done in cases like this? Thanks!

ADD REPLY • link 14 months ago by AHerik ▴ 20