Question: How To Decide Whether To Keep: Het, U And Extra For Mapping Drosophila Genome
5
gravatar for Rm
9.6 years ago by
Rm8.0k
Danville, PA
Rm8.0k wrote:

I am processing Illumina reads from many lanes. We are mainly interested to study SNPs, recombination etc in chromosomes (2L, 2R, 3L, 3R, 4 and X). I have a basic question regarding "mapping of reads to the Drosophila genome": Do I need to include chromosomes Het, U and Extra's for mapping or exclude them and map to the rest of the genome. How does this affect? I need your thaughts in support or against.

ADD COMMENTlink modified 6.3 years ago by Biostar ♦♦ 20 • written 9.6 years ago by Rm8.0k
9
gravatar for Casey Bergman
9.6 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

The U (for unmapped) and Extra sequences are a mixture of unmapped heterochromatic scaffolds from the D. melanogaster whole genome shotgun assemblies of the y; cn, bw, sp strain. The Het sequences are heterochromatic scaffolds from the WGS whose sequences have been mapped to the Y chromosome or to extend the euchromatic chromosome arms (2L, 2R, 3L, 3R, 4 and X) and in some cases their sequence has been improved by BAC/plasmid sequencing. For more information see: http://www.fruitfly.org/sequence/README.RELEASE5

I suggest including the Het scaffolds in addition to the euchromatic arms in your mapping, since these reference sequences have been mapped/finished/annotated and contain known genes. However, as these scaffolds contain a high repeat abundance, mapping to these scaffolds may be tricky. See the following articles for more information: http://www.sciencemag.org/cgi/content/full/316/5831/1625 & http://www.sciencemag.org/cgi/content/full/316/5831/1586

As an aside, the U sequences also include a near-complete version of the the y; cn, bw, sp mitochrondrial genome. The mitochrondrial genome served by UCSC (chrM) is from a different strain, see: http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=9554

ADD COMMENTlink modified 9.6 years ago • written 9.6 years ago by Casey Bergman18k
4
gravatar for brentp
9.6 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Use all of them. If you want to know where a read came, the only way to do that is to give it all possible references (given constraints on how fully-sequenced that reference is). Many aligners will allow you to discard reads that map to more than 1 (or N) genomic location (bowtie's -m 1 or gsnap's --npaths 1 for example) or to "probabilistically" (or uniformly) align reads that map to multiple locations in the reference. Both of those features will be more correct given the full available reference.

As with most things, there's no substitute for trying it out. Try the alignment once including the sequences you mention, and once without, and find the differences--those might actually be interesting sites to look at more closely.

ADD COMMENTlink written 9.6 years ago by brentp23k

thanks @casey and @brentp: I am trying both on a test set and will see how it will reflect in results

ADD REPLYlink written 9.6 years ago by Rm8.0k

I am trying to map using "bwa" is there a way I can implement similar to "bowtie -m 1"

ADD REPLYlink written 9.6 years ago by Rm8.0k

i dont use bwa much, but it looks like the samse and sampe commands have a -n parameter which does close to that.

ADD REPLYlink written 9.6 years ago by brentp23k

thanks - @brentp

ADD REPLYlink written 9.6 years ago by Rm8.0k

One of the things to watch out for when analyzing D. mel U sequences is that they contain non-fly bacterial DNA from sequencing plasmids. See the following post for thoughts on this latent problem.

ADD REPLYlink modified 8 months ago by RamRS27k • written 9.6 years ago by Casey Bergman18k

Uextra is aspecially problematic in notes to release it reads:

"we have not excluded scaffolds which may be redundant with euchromatic or other heterochromatic regions. Nor can we exclude the possibility of contaminations from other organisms.

We are making this data available as a resource for analysis of region which cannot be assembled well, such as satelites or simple repeats.

Since some of this data is low quality, researchers are encouraged to contact either BDGP or DHGP for further details on this resource."

ADD REPLYlink written 6.3 years ago by pawelsm10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1726 users visited in the last hour