Which Fasta file and GTF file to use in STAR alignment
1
0
Entering edit mode
20 days ago
peru • 0

Hello, this is a very basic question but I was wondering if someone could help me understand if I've used the correct GTF file and Fasta file for the mouse genome indexing. I got the relevant Fasta file and GTF file from ensembl: GTF:ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz Fasta:ftp.ensembl.org/pub/release103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

Or shell I use Mus_musculus.GRCm39.dna.toplevel.fa.gz for fasta to make Generating genome indexes in STAR? STAR --runMode genomeGenerate --runThreadN 8 --genomeDir index_reference --genomeFastaFiles Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm39.103.gtf

STAR ENSEMBL GTF fa • 297 views
1
Entering edit mode

Hi @peru,

Yes, it looks good. You can always refer to the STAR manual, section 2.2, subsection 2.2.1. In general, you have to make sure chromosome names in your genome fasta and in your gtf are identical (chr1 vs 1). Since you got both fasta and gtf from the same source (Ensembl) and the same genome release version (CRCm39) you should be fine.

1
Entering edit mode

shell I use Mus_musculus.GRCm39.dna.toplevel.fa.gz for fasta to make Generating genome indexes in STAR

Top level file normally contains haplotypes with the genome padded out to full length for each one. In case of GRCm39 mouse top level file appears to be the same as primary so either could be used. Primary is safe bet.

3
Entering edit mode
20 days ago
Iván ▴ 50

The genome construction step line looks good. Also consider including the --sjdbOverhang parameter. While the default value of 99 is usually fine, as stated in STAR's manual this should be chosen according to your maximum read length in the dataset by subtracting 1. So, if you have 100 nt reads, --sjdbOverhang 99 is fine.

If you performed trimming and has variable read lengths, then you choose this value according to max(readLength-1). So if you have reads varying from 50-120, you'd have --sjdbOverhang 119 as optimal parameter.

0
Entering edit mode

I got the point that fasta and gtf file was correct. In additon I will align RNA with maximum of 101bp so my Input will be like under.

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir index_reference --genomeFastaFiles Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm39.103.gtf --sjbOverhang 100

Thenks for all replies!

0
Entering edit mode

You don't have to change the default --sjdbOverhang value unless you have very short reads (less than 50 bp; see here). If your reads are longer than that don't waste your time and use the default value (99).