Just a general question regarding choice of assembly version for RNA-Seq mapping. I have a bos taurus RNA-seq dataset, and my pipeline uses STAR for reference assembly alignment with featureCounts to count the genes. End goal is differential gene expression analysis, GO/Pathway etc.
The Star developers suggest to use an assembly file that excludes patches and haplotypes, and in Ensembl these are the "..dna.primary.assembly.fa.gz" files. For the current B. taurus assembly release (ARS-UCD1.2) there is no combined "primary.assembly.fa.gz" file for the whole reference, only separate files for each chromosome (from what I can see, all <100mb in size).
So my question - in the absence of the combined primary assembly file for the whole reference genome, can I
- a) just CAT all the chromosome files into one big fasta file and proceed with that - not sure if this will prevent me using the associated .gtf annotation file downstream?
- b) Should I rely on the "toplevel.fa" file (~800mb, around expected size)
I realise similar questions have been asked here (e.g. How to select a gene annotation file in ensembl that contain un-localized scaffolds, but NO patches or haplotypes? ) but I'm interested in this particular case (i.e. bovine assembly with no complete primary assembly file). For example, the human reference assembly does have a primary assembly of the whole genome. the current dataset I'm working on is also poor quality due to the wet lab approach (no mRNA enrichment), so just want to validate I'm taking the best approach.
Thanks a lot