Drosophila genome indexing using STAR for mapping reads
2.9 years ago
ishento • 0

Could anyone let me know which FASTA files that can be used to build genome index for analysis using STAR. 1- In Ensemble there is 10 unmasked fasta files (chromosomes 2L, 2R, 3L, 3R, 4, X, Y, nochromosomal, mitochondrion genome, and toplevel) ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/dna/. Which files should be included? Or is the dna_index the one that should be used ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/dna_index/

2- In flybase, there are also several fasta files, ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.21_FB2018_02/fasta/ Which files should be used for building genome index using STAR for mapping reads?

This has been discussed several times ; have a look at this post

I open the old post, but I still confused. I have 7 files for chromosoms, nonchromosomal, mitochonderion, toplevel. is it right to use all?

Okay, I will do, Thanks for your response

in emsemble; I think the toplevel is the one should be used. am I right?

I still confused. I have 7 files for chromosoms, nonchromosomal, mitochonderion, toplevel. is it right to use all?

2.9 years ago

If you take a look at this link : ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/dna/

At the botom you have a README file, which said :

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

If you want all informations about the current genome you have to take the Drosophila_melanogaster.BDGP6.dna.toplevel.fa.gz and only this one to create your index

NOTE : But I still don't know why there is a different size file between Drosophila_melanogaster.BDGP6.dna.toplevel.fa.gz in ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/dna_index/ and ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/dna/ (see also this post : Fasta file and GTF file for STAR alignment )

in STAR manual, it stated that "Generally, patches and alternative haplotypes should not be included in the genome". and I think tolevel has haplotypes.

0
Depends on your downstream analysis, if you don't care about haplotypes and you want to do a differential expression you can go for the primary assembly. Otherwise if you want to do a variant calling, for example, you will have to take the toplevel to not get false positive. See the @Vijay's comment above ( Filtering out chromosomes from reference genome )