Which genome files to use for STAR?
0
3
Entering edit mode
3.9 years ago
Nico80 ▴ 60

I am trying to build a genome index for use with STAR, and I am a bit confused on which files I should use.

According to the STAR manual (§2.2.1)

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.

wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{1..22}.fa.gz
wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{MT,X,Y}.fa.gz


Homo_sapiens.GRCh38.dna.nonchromosomal.fa.gz: are these the scaffold reads the STAR manual is talking about? The README file on the ENSEMBL FTP seems to imply scaffold reads are in seqlevel files, but I cannot see any.

Homo_sapiens.GRCh38.dna.toplevel.fa.gz: the README states this

contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

So, according to the STAR manual I should not include this, is this correct?

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz This contains

all toplevel sequence regions excluding haplotypes and patches.

So could I just use this instead of the chromosome files above? Or should I use it in addition?

alignment star genome • 5.2k views
2
Entering edit mode

Just use Homo_sapiens.GRCh38.dna.primary_assembly.fa for reference, it doesn't make sense to concatenate all the other files to get the same file.

0
Entering edit mode

Thank you Benn, just out of curiosity, could you confirm whether my understanding of what the different files are is correct?

0
Entering edit mode

I don't know the answers to all your questions about what's in the different files or not, if you are interested you can download them and see what's in it. The STAR manual tells us that Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz is an acceptable file to use, so that's why I recommended you to use it. Good luck with the mapping.

1
Entering edit mode

You will get the reference genome here: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/