I am processing Illumina reads from many lanes. We are mainly interested to study SNPs, recombination etc in chromosomes (2L, 2R, 3L, 3R, 4 and X). I have a basic question regarding "mapping of reads to the Drosophila genome": Do I need to include chromosomes Het, U and Extra's for mapping or exclude them and map to the rest of the genome. How does this affect? I need your thaughts in support or against.
The U (for unmapped) and Extra sequences are a mixture of unmapped heterochromatic scaffolds from the D. melanogaster whole genome shotgun assemblies of the y; cn, bw, sp strain. The Het sequences are heterochromatic scaffolds from the WGS whose sequences have been mapped to the Y chromosome or to extend the euchromatic chromosome arms (2L, 2R, 3L, 3R, 4 and X) and in some cases their sequence has been improved by BAC/plasmid sequencing. For more information see: http://www.fruitfly.org/sequence/README.RELEASE5
I suggest including the Het scaffolds in addition to the euchromatic arms in your mapping, since these reference sequences have been mapped/finished/annotated and contain known genes. However, as these scaffolds contain a high repeat abundance, mapping to these scaffolds may be tricky. See the following articles for more information: http://www.sciencemag.org/cgi/content/full/316/5831/1625 & http://www.sciencemag.org/cgi/content/full/316/5831/1586
As an aside, the U sequences also include a near-complete version of the the y; cn, bw, sp mitochrondrial genome. The mitochrondrial genome served by UCSC (chrM) is from a different strain, see: http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=9554
Use all of them. If you want to know where a read came, the only way to do that is to give it all possible references (given constraints on how fully-sequenced that reference is). Many aligners will allow you to discard reads that map to more than 1 (or N) genomic location (bowtie's -m 1 or gsnap's --npaths 1 for example) or to "probabilistically" (or uniformly) align reads that map to multiple locations in the reference. Both of those features will be more correct given the full available reference.
As with most things, there's no substitute for trying it out. Try the alignment once including the sequences you mention, and once without, and find the differences--those might actually be interesting sites to look at more closely.