Question: Which genome files to use for STAR?
0
gravatar for Nico80
3 months ago by
Nico800
University of Edinburgh, UK
Nico800 wrote:

I am trying to build a genome index for use with STAR, and I am a bit confused on which files I should use.

According to the STAR manual (§2.2.1)

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.

I have downloaded the following:

wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{1..22}.fa.gz
wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{MT,X,Y}.fa.gz

I have not downloaded the masked genomes (_rm and _sm), but what about the following files?

Homo_sapiens.GRCh38.dna.nonchromosomal.fa.gz: are these the scaffold reads the STAR manual is talking about? The README file on the ENSEMBL FTP seems to imply scaffold reads are in seqlevel files, but I cannot see any.

Homo_sapiens.GRCh38.dna.toplevel.fa.gz: the README states this

contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

So, according to the STAR manual I should not include this, is this correct?

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz This contains

all toplevel sequence regions excluding haplotypes and patches.

So could I just use this instead of the chromosome files above? Or should I use it in addition?

star alignment genome • 179 views
ADD COMMENTlink written 3 months ago by Nico800
1

Just use Homo_sapiens.GRCh38.dna.primary_assembly.fa for reference, it doesn't make sense to concatenate all the other files to get the same file.

ADD REPLYlink written 3 months ago by Benn7.4k

Thank you Benn, just out of curiosity, could you confirm whether my understanding of what the different files are is correct?

ADD REPLYlink written 3 months ago by Nico800

I don't know the answers to all your questions about what's in the different files or not, if you are interested you can download them and see what's in it. The STAR manual tells us that Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz is an acceptable file to use, so that's why I recommended you to use it. Good luck with the mapping.

ADD REPLYlink written 3 months ago by Benn7.4k

You will get the reference genome here: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/

SOURCE: [Click here ---> https://github.com/STAR-Fusion/STAR-Fusion/wiki] ----> go to Data Recource Required

ADD REPLYlink modified 3 months ago • written 3 months ago by Ranan Jyoti Sarma40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1737 users visited in the last hour