Took these two screenshots recently, thought you all might enjoy them. Shows our evolving understanding of the complexity of the human genome at the population-scale, as reflected in the changes in the reference genome between hg19 and hg38.
(ignore the temp folder, not part of hg19)
Note: I make no claims as to actual accuracy or completeness of these two directory listings, I haven't taken a close look at them in a while. But they're interesting from a big-picture perspective.
hg19 / GRCh37: 2009
hg38 / GRCh38: 2013
By doing this you're creating false mapping positives. A read that could have been 100% mapped to chrUn_xxx_random could be mapped 90% to 'chr2' -> false positive called.
I think you might be misunderstanding the point of the filter. It's to remove human reads from plant and fungal sequencing projects. Ignoring or masking parts of the human reference has absolutely no chance of creating false positives, only false negatives. In this case, a false positive is a fungal read that gets removed because it maps to the human reference - which would yield an incomplete assembly - and a false negative is a human read that passes through the filter. False negatives can be cleaned up later by mapping the assembly to nt, but the earlier you can remove contamination, the better.
You need to show proof to state such strong words. I have collaborated with GRC a few times. They are extremely careful about sequences added to the human reference genome. They routinely blasted against nt to check contaminations and are very familiar with GenBank. They told me which individual fosmid/bac are likely to be mislabeled as human. Given this, I can hardly believe contamination is a severe problem with the human reference genome. Actually if you check read mapping, they are mostly covered by human reads. Are you saying most human samples are contaminated with fungi? If your evidence is just fungi hits, they could be low-complexity sequences. Unplaced/unlocalized/alt sequences tend to be enriched with segmental duplications, repeats, satellites and low-complexity sequences. Because of this, many summary statistics on these sequences are very different from chromosomal sequences. Have you checked your fungi hits are not low-complexity? Are you sure they are truly not human contaminations?
If you have found a single piece of sequence that is susceptible to contamination, let GRC know. They will take it very seriously.
Yes, I verified quite thoroughly that the contaminant sequences were long (several hundred bp), high-complexity, non-ribosomal, similar only to other fungal sequences but no vertebrate sequences, and exactly matched fungi that are known to infect humans.
It's not like the non-primary sequences are all entirely contaminant, or mostly contaminant; I would not say that contamination is a severe problem with the human reference. "A lot of the stuff in the unplaced contigs is junk" was probably an overstatement; I'll change that to "I found multiple unambiguously fungal sequences in the non-primary files and lost confidence in being able to use them for a filter that couldn't have any false positives." Each one took a long time to figure out whether it was a coincidental match, or human contamination in a fungal reference, or fungal contamination in the human reference, or both human and fungal references contaminated with a synthetic sequence - all three scenarios occurred.
Anyway, you're welcome to replicate this, it's just a huge amount of work. Mask the human reference to remove low complexity and ribosomal elements, download all fungal assemblies from Mycocosm, shred them into 100bp pieces, map them to human requiring ~98% identity, and look for where multiple shreds from the same fungus hit consecutively. Then blast those shreds, and either they will hit human, chimp, orangutan, (etc) and possibly the fungus in question - in which case it's human contamination of the fungal assembly - or they'll hit only human and a bunch of funguses, but no other vertebrates - in which case it's fungal contamination in the human genome.
I am not sure your procedure is conclusive. There are only a-few-genome-worth of primate sequences in nt. It is not hard to find some megabases of true human sequences having no hits to other primates. If your Mycocosm samples are contaminated by these human-specific sequences, you will find high identity to human genome but poor alignment to primates. As I said, non-chromosomal sequences tend to be duplications and repeats. Probably they also tend to be different from other primates. Then you will see non-chromosomal sequences enriched with "contaminations". A better approach to confirm these potential contaminations is to check whether they are present in other human resequencing data. If they are present, they are probably true human sequences, on the assumption that Mycocosm are not common contaminations in human resequencing.
Could you paste a couple of such sequences here? I can help to check if they are present in other human data. I am also pretty sure that GRC are eager to chase and fix all known contaminations in the human reference genome.
Oh, interesting. There's still a lot of alt-ref-loci that are actually placed on chromosomes, even if you exclude all the ChrUn's -- have you found those to be suspect? For instance, my understanding was that for places like the KIR regions, hg19 smooshed them all into one-ish region, and hg38 instead let them exist as
alt_ref_loci
, which helps explain what the two screenshots are showing. See http://www.ncbi.nlm.nih.gov/gene/3803 (section title: RefSeqs of Annotated Genomes: Homo sapiens Annotation Release 107)I didn't really spend a lot of time figuring out exactly which ones might be reliable, but the unplaced ones were definitely the worst.