Question: Fasta file and GTF file for STAR alignment
0
gravatar for snp87
9 months ago by
snp8740
snp8740 wrote:

Hello, this is a very basic question but I was wondering if someone could help me understand if I've used the correct GTF file and Fasta file for the mouse genome indexing (I'm using STAR). I got the relevant Fasta file and GTF file from ensembl: Mus_musculus.GRCm38.92.gtf.gz from ftp://ftp.ensembl.org/pub/release-92/gtf/mus_musculus/ and Mus_musculus.GRCm38.dna.toplevel.fa.gz from ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna_index/

Thank you so much!

star ensembl • 1.6k views
ADD COMMENTlink modified 9 months ago by swbarnes25.0k • written 9 months ago by snp8740

tagging: Emily_Ensembl

ADD REPLYlink written 9 months ago by genomax64k
3
gravatar for Erin_Ensembl
9 months ago by
Erin_Ensembl310
EMBL-EBI
Erin_Ensembl310 wrote:

Hello there,

The top-level fasta file will include chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions. See more here: ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/README. If you are only looking for reference genome assembly chromosome level sequences then use the primary_assembly.fa file.

The files in the dna_index directory are genomic sequence files which are bgzipped and tabix indexed (for more details on what this means see: http://www.htslib.org/doc/tabix.html). These are downloaded by the Variant Effect Predictor (VEP) installer to allow quicker VEP'ing. The fasta file without the .fai or .gzi suffix, although stated to be a different size, is identical to the fasta file in the fasta/mus_musculus/dna/ folder so you can download either and you'd get the same data.

We'll update the README files, or 'hide' the dna_index folder to avoid confusion between these files in the two folders. Thanks for bringing it to our attention!

ADD COMMENTlink written 9 months ago by Erin_Ensembl310
0
gravatar for Bastien Hervé
9 months ago by
Bastien Hervé3.7k
Limoges, CBRS, France
Bastien Hervé3.7k wrote:

If you want to analyse haplotypes you have the good fasta file.

The GTF is the good one

Becareful, chromosome names are not "standard" and could struggle some aligners. In your file chr1 is named 1, maybe you would have to rename each chromosome chr1, ch2 etc

ADD COMMENTlink written 9 months ago by Bastien Hervé3.7k

Thanks for your reply. I'm not sure if I understand what you mean. I'm hoping to do a differential expression analysis after the alignments. In the fasta files there were different options, cdna, cds, dna, dna index, ncrna and pep (ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/). I went with the fasta file in dna index but not really sure if this is what should be done. Anyone know how you decide about this?

ADD REPLYlink written 9 months ago by snp8740

Sorry for my late reply,

The file you need is Mus_musculus.GRCm38.dna.toplevel.fa.gz

Infortunaly, this file exist, with 2 different sizes of file, in ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna_index/, but also in ftp://ftp.ensembl.org/pub/release-92/fasta/mus_musculus/dna/

Maybe try to contact Emily_Ensembl which is the person to contact for Ensembl stuff

Try to add the tag ensembl in your post's tags. I bet that she is following this tag

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé3.7k
0
gravatar for swbarnes2
9 months ago by
swbarnes25.0k
United States
swbarnes25.0k wrote:

From ensembl (emphasis mine)

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

From the STAR manual (emphasis mine)

2.2.1 Which chromosomes/scaffolds/patches to include?

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome. Examples of acceptable genome sequence files: • ENSEMBL: files marked with .dna.primary.assembly, such as:

ftp://ftp.ensembl.org/ pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly. fa.gz

So I'd say no, you don't have the right reference. Use "primary assembly" as recommended.

ADD COMMENTlink modified 9 months ago • written 9 months ago by swbarnes25.0k

Opened discussion : If OP want to call variants, is it not a bit dangerous to use only "primary assembly" ?

Let's say that chr6_FIXED is a fixed part of chr6 that will be added in the next major release. This modification change a A to a T. In primary assembly you don't have this chr6_FIXED but you have it in toplevel.

I mean if a read has a perfect match on chr6_FIXED and a match with one mismatch on chr6. If you keep the primary assembly you could have call a variant that you would have never called with toplevel. Leading to false positive result.

It's just a thinking, started discussion here

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour