Question: Should I keep up with random contigs in human genome
gravatar for A
4 months ago by
A3.7k wrote:


I have a big annovar annotation of my SNV and INDEL from whole genome sequencing

In chromosome column I have chr 1 to 22 and another contigs like

> unique(anno_maf$Chromosome)
 [1] "chr1"                  "chr6"                 
 [3] "chr16"                 "chr17"                
 [5] "chr20"                 "chr2"                 
 [7] "chr3"                  "chr4"                 
 [9] "chr14"                 "chr19"                
[11] "chr5"                  "chr10"                
[13] "chr9"                  "chr12"                
[15] "chr13"                 "chr11"                
[17] "chr22"                 "chr7"                 
[19] "chr15"                 "chr8"                 
[21] "chr18"                 "chr21"                
[23] "chrX"                  "chrY"                 
[25] "chr4_gl000194_random"  "chr17_gl000205_random"
[27] "chrUn_gl000241"        "hs37d5"               
[29] "chrUn_gl000219"        "chrUn_gl000234"       
[31] "chr1_gl000191_random"  "chrUn_gl000211"       
[33] "chrUn_gl000224"        "chrUn_gl000225"       
[35] "chr17_gl000203_random" "chrUn_gl000212"       
[37] "chrUn_gl000243"        "chrUn_gl000214"       
[39] "chrM"                  "chr1_gl000192_random" 
[41] "chr7_gl000195_random"  "chrUn_gl000232"       
[43] "chr4_gl000193_random"  "chr19_gl000208_random"
[45] "chrUn_gl000226"        "chrUn_gl000218"       
[47] "chr9_gl000199_random"  "chrUn_gl000217"       
[49] "chrUn_gl000229"        "chrUn_gl000216"       
[51] "chrUn_gl000231"        "chr9_gl000198_random" 
[53] "chr17_gl000204_random" "chrUn_gl000220"       
[55] "chrUn_gl000235"        "chr11_gl000202_random"
[57] "chrUn_gl000222"        "chrUn_gl000240"       
[59] "chrUn_gl000233"        "chrUn_gl000230"       
[61] "chrUn_gl000213"        "chrUn_gl000238"       
[63] "chr19_gl000209_random" "chrUn_gl000237"       
[65] "#CHROM"

In your experiences, should I ignore anything else than chromosome 1 to chromosome 21 in my analysis? I mean for instance if my goal is comparing some somatic variations and copy number changes in two different conditions, does it make sense to analysis only chr 1 to 21 ignoring the rest of random or non well annotated parts of genome? Does it hurt at all?

somatic variation wgs vcf • 248 views
ADD COMMENTlink modified 4 months ago by WouterDeCoster43k • written 4 months ago by A3.7k

Sex chromosomes (XY) and chr 1..22 are preferred chromosomes for alignment based on my understanding. Please refer to following links:

ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011213k
gravatar for WouterDeCoster
4 months ago by
WouterDeCoster43k wrote:

It is best to use an as complete genome as possible for alignment. That means you should include the unplaced contigs. As such you are sure that you don't get any false-positive alignments on the "real" chromosomes and as such false-positive variants. See also this blog post from Heng Li to learn what matters and which reference genome you should use:

But after alignment and variant calling it is safe to ignore those: Of course those variants can be real and have real consequences, but you probably shouldn't trust them too much and may want to focus on the better-understood and annotated "real" chromosomes.

ADD COMMENTlink written 4 months ago by WouterDeCoster43k

Hi Wouter. In your knowledge do you know some paper discussing/benchmarking the impact of using hg38 (or hg19) with/without unplaced and unlocalized sequences on variant calling ?

ADD REPLYlink written 4 months ago by Nicolas Rosewick8.8k

I'll think about it. Not exactly what you are asking for, but this paper comes to mind: Ameur et al. 2018. Short summary: they de novo assemble two genomes, and find contigs missing from the reference genome. Inclusion of those contigs has a large impact on short-read alignment and variant calling in the rest of their cohort, suggesting incomplete reference creates both false positive and false negative variants.

ADD REPLYlink written 4 months ago by WouterDeCoster43k

I think paper summary is in these two lines: Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Conclusion of the paper is GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data. (Conclusions are copy/pasted from the manuscript abstract)

Authors discovered 10 Mb sequence that is/was not part of GRCh38 in swedish population part of which is shared by Chinese genome efforts.

ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011213k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2194 users visited in the last hour