How To Decide Whether To Keep: Het, U And Extra For Mapping Drosophila Genome
2
5
Entering edit mode
13.5 years ago
Rm 8.3k

I am processing Illumina reads from many lanes. We are mainly interested to study SNPs, recombination etc in chromosomes (2L, 2R, 3L, 3R, 4 and X). I have a basic question regarding "mapping of reads to the Drosophila genome": Do I need to include chromosomes Het, U and Extra's for mapping or exclude them and map to the rest of the genome. How does this affect? I need your thaughts in support or against.

mapping next-gen sequencing genome chromosome illumina • 5.9k views
ADD COMMENT
9
Entering edit mode
13.5 years ago

The U (for unmapped) and Extra sequences are a mixture of unmapped heterochromatic scaffolds from the D. melanogaster whole genome shotgun assemblies of the y; cn, bw, sp strain. The Het sequences are heterochromatic scaffolds from the WGS whose sequences have been mapped to the Y chromosome or to extend the euchromatic chromosome arms (2L, 2R, 3L, 3R, 4 and X) and in some cases their sequence has been improved by BAC/plasmid sequencing. For more information see: http://www.fruitfly.org/sequence/README.RELEASE5

I suggest including the Het scaffolds in addition to the euchromatic arms in your mapping, since these reference sequences have been mapped/finished/annotated and contain known genes. However, as these scaffolds contain a high repeat abundance, mapping to these scaffolds may be tricky. See the following articles for more information: http://www.sciencemag.org/cgi/content/full/316/5831/1625 & http://www.sciencemag.org/cgi/content/full/316/5831/1586

As an aside, the U sequences also include a near-complete version of the the y; cn, bw, sp mitochrondrial genome. The mitochrondrial genome served by UCSC (chrM) is from a different strain, see: http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=9554

ADD COMMENT
4
Entering edit mode
13.5 years ago
brentp 24k

Use all of them. If you want to know where a read came, the only way to do that is to give it all possible references (given constraints on how fully-sequenced that reference is). Many aligners will allow you to discard reads that map to more than 1 (or N) genomic location (bowtie's -m 1 or gsnap's --npaths 1 for example) or to "probabilistically" (or uniformly) align reads that map to multiple locations in the reference. Both of those features will be more correct given the full available reference.

As with most things, there's no substitute for trying it out. Try the alignment once including the sequences you mention, and once without, and find the differences--those might actually be interesting sites to look at more closely.

ADD COMMENT
0
Entering edit mode

thanks @casey and @brentp: I am trying both on a test set and will see how it will reflect in results

ADD REPLY
0
Entering edit mode

I am trying to map using "bwa" is there a way I can implement similar to "bowtie -m 1"

ADD REPLY
0
Entering edit mode

i dont use bwa much, but it looks like the samse and sampe commands have a -n parameter which does close to that.

ADD REPLY
0
Entering edit mode

thanks - @brentp

ADD REPLY
0
Entering edit mode

One of the things to watch out for when analyzing D. mel U sequences is that they contain non-fly bacterial DNA from sequencing plasmids. See the following post for thoughts on this latent problem.

ADD REPLY
0
Entering edit mode

Uextra is aspecially problematic in notes to release it reads:

"we have not excluded scaffolds which may be redundant with euchromatic or other heterochromatic regions. Nor can we exclude the possibility of contaminations from other organisms.

We are making this data available as a resource for analysis of region which cannot be assembled well, such as satelites or simple repeats.

Since some of this data is low quality, researchers are encouraged to contact either BDGP or DHGP for further details on this resource."

ADD REPLY

Login before adding your answer.

Traffic: 2097 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6