Fasta file and GTF file for STAR alignment
3
0
Entering edit mode
2.9 years ago
snp87 ▴ 50

Hello, this is a very basic question but I was wondering if someone could help me understand if I've used the correct GTF file and Fasta file for the mouse genome indexing (I'm using STAR). I got the relevant Fasta file and GTF file from ensembl: Mus_musculus.GRCm38.92.gtf.gz from ftp://ftp.ensembl.org/pub/release-92/gtf/mus_musculus/ and Mus_musculus.GRCm38.dna.toplevel.fa.gz from ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna_index/

Thank you so much!

STAR ensembl • 6.4k views
0
Entering edit mode

tagging: Emily_Ensembl

6
Entering edit mode
2.8 years ago
Erin_Ensembl ▴ 410

Hello there,

The top-level fasta file will include chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. See more here: ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/README. If you are only looking for reference genome assembly chromosome level sequences then use the primary_assembly.fa file.

The files in the dna_index directory are genomic sequence files which are bgzipped and tabix indexed (for more details on what this means see: http://www.htslib.org/doc/tabix.html). These are downloaded by the Variant Effect Predictor (VEP) installer to allow quicker VEP'ing. The fasta file without the .fai or .gzi suffix, although stated to be a different size, is identical to the fasta file in the fasta/mus_musculus/dna/ folder so you can download either and you'd get the same data.

We'll update the README files, or 'hide' the dna_index folder to avoid confusion between these files in the two folders. Thanks for bringing it to our attention!

1
Entering edit mode
2.8 years ago
swbarnes2 9.7k

From ensembl (emphasis mine)

## TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

From the STAR manual (emphasis mine)

2.2.1 Which chromosomes/scaffolds/patches to include?

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome. Examples of acceptable genome sequence files: • ENSEMBL: files marked with .dna.primary.assembly, such as:

ftp://ftp.ensembl.org/ pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly. fa.gz

So I'd say no, you don't have the right reference. Use "primary assembly" as recommended.

0
Entering edit mode

Opened discussion : If OP want to call variants, is it not a bit dangerous to use only "primary assembly" ?

Let's say that chr6_FIXED is a fixed part of chr6 that will be added in the next major release. This modification change a A to a T. In primary assembly you don't have this chr6_FIXED but you have it in toplevel.

I mean if a read has a perfect match on chr6_FIXED and a match with one mismatch on chr6. If you keep the primary assembly you could have call a variant that you would have never called with toplevel. Leading to false positive result.

It's just a thinking, started discussion here

0
Entering edit mode
2.9 years ago

If you want to analyse haplotypes you have the good fasta file.

The GTF is the good one

Becareful, chromosome names are not "standard" and could struggle some aligners. In your file chr1 is named 1, maybe you would have to rename each chromosome chr1, ch2 etc

0
Entering edit mode

Thanks for your reply. I'm not sure if I understand what you mean. I'm hoping to do a differential expression analysis after the alignments. In the fasta files there were different options, cdna, cds, dna, dna index, ncrna and pep (ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/). I went with the fasta file in dna index but not really sure if this is what should be done. Anyone know how you decide about this?

0
Entering edit mode

Sorry for my late reply,

The file you need is Mus_musculus.GRCm38.dna.toplevel.fa.gz

Infortunaly, this file exist, with 2 different sizes of file, in ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna_index/, but also in ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/

Maybe try to contact Emily_Ensembl which is the person to contact for Ensembl stuff

Try to add the tag ensembl in your post's tags. I bet that she is following this tag