Question

Which Fasta file and GTF file to use in STAR alignment

4

Entering edit mode

3.0 years ago

peru ▴ 40

Hello, this is a very basic question but I was wondering if someone could help me understand if I've used the correct GTF file and Fasta file for the mouse genome indexing. I got the relevant Fasta file and GTF file from ensembl: GTF:ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz Fasta:ftp.ensembl.org/pub/release103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

Or shell I use Mus_musculus.GRCm39.dna.toplevel.fa.gz for fasta to make Generating genome indexes in STAR? STAR --runMode genomeGenerate --runThreadN 8 --genomeDir index_reference --genomeFastaFiles Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm39.103.gtf

Thank you for your help!

GTF fasta ENSEMBL STAR • 5.5k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.0 years ago by peru ▴ 40

3

Entering edit mode

shell I use Mus_musculus.GRCm39.dna.toplevel.fa.gz for fasta to make Generating genome indexes in STAR

Top level file normally contains haplotypes with the genome padded out to full length for each one. In case of GRCm39 mouse top level file appears to be the same as primary so either could be used. Primary is safe bet.

ADD REPLY • link 3.0 years ago by GenoMax 141k

2

Entering edit mode

Hi @peru,

Yes, it looks good. You can always refer to the STAR manual, section 2.2, subsection 2.2.1. In general, you have to make sure chromosome names in your genome fasta and in your gtf are identical (chr1 vs 1). Since you got both fasta and gtf from the same source (Ensembl) and the same genome release version (CRCm39) you should be fine.

ADD REPLY • link 3.0 years ago by opplatek ▴ 290

score 4 · Accepted Answer · 2021-04-15

4

Entering edit mode

3.0 years ago

Iván ▴ 60

The genome construction step line looks good. Also consider including the --sjdbOverhang parameter. While the default value of 99 is usually fine, as stated in STAR's manual this should be chosen according to your maximum read length in the dataset by subtracting 1. So, if you have 100 nt reads, --sjdbOverhang 99 is fine.

If you performed trimming and has variable read lengths, then you choose this value according to max(readLength-1). So if you have reads varying from 50-120, you'd have --sjdbOverhang 119 as optimal parameter.

ADD COMMENT • link 3.0 years ago by Iván ▴ 60

1

Entering edit mode

You don't have to change the default --sjdbOverhang value unless you have very short reads (less than 50 bp; see here). If your reads are longer than that don't waste your time and use the default value (99).

ADD REPLY • link 3.0 years ago by opplatek ▴ 290

0

Entering edit mode

I got the point that fasta and gtf file was correct. In additon I will align RNA with maximum of 101bp so my Input will be like under.

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir index_reference --genomeFastaFiles Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm39.103.gtf --sjbOverhang 100

Thenks for all replies!

ADD REPLY • link 3.0 years ago by peru ▴ 40