Question

Should I keep up with random contigs in human genome

1

Entering edit mode

4.3 years ago

zizigolu ★ 4.3k

Hi

I have a big annovar annotation of my SNV and INDEL from whole genome sequencing

In chromosome column I have chr 1 to 22 and another contigs like

> unique(anno_maf$Chromosome)
 [1] "chr1"                  "chr6"                 
 [3] "chr16"                 "chr17"                
 [5] "chr20"                 "chr2"                 
 [7] "chr3"                  "chr4"                 
 [9] "chr14"                 "chr19"                
[11] "chr5"                  "chr10"                
[13] "chr9"                  "chr12"                
[15] "chr13"                 "chr11"                
[17] "chr22"                 "chr7"                 
[19] "chr15"                 "chr8"                 
[21] "chr18"                 "chr21"                
[23] "chrX"                  "chrY"                 
[25] "chr4_gl000194_random"  "chr17_gl000205_random"
[27] "chrUn_gl000241"        "hs37d5"               
[29] "chrUn_gl000219"        "chrUn_gl000234"       
[31] "chr1_gl000191_random"  "chrUn_gl000211"       
[33] "chrUn_gl000224"        "chrUn_gl000225"       
[35] "chr17_gl000203_random" "chrUn_gl000212"       
[37] "chrUn_gl000243"        "chrUn_gl000214"       
[39] "chrM"                  "chr1_gl000192_random" 
[41] "chr7_gl000195_random"  "chrUn_gl000232"       
[43] "chr4_gl000193_random"  "chr19_gl000208_random"
[45] "chrUn_gl000226"        "chrUn_gl000218"       
[47] "chr9_gl000199_random"  "chrUn_gl000217"       
[49] "chrUn_gl000229"        "chrUn_gl000216"       
[51] "chrUn_gl000231"        "chr9_gl000198_random" 
[53] "chr17_gl000204_random" "chrUn_gl000220"       
[55] "chrUn_gl000235"        "chr11_gl000202_random"
[57] "chrUn_gl000222"        "chrUn_gl000240"       
[59] "chrUn_gl000233"        "chrUn_gl000230"       
[61] "chrUn_gl000213"        "chrUn_gl000238"       
[63] "chr19_gl000209_random" "chrUn_gl000237"       
[65] "#CHROM"

In your experiences, should I ignore anything else than chromosome 1 to chromosome 21 in my analysis? I mean for instance if my goal is comparing some somatic variations and copy number changes in two different conditions, does it make sense to analysis only chr 1 to 21 ignoring the rest of random or non well annotated parts of genome? Does it hurt at all?

WGS vcf somatic variation • 2.7k views

ADD COMMENT • link updated 4.3 years ago by WouterDeCoster 47k • written 4.3 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Sex chromosomes (XY) and chr 1..22 are preferred chromosomes for alignment based on my understanding. Please refer to following links:

ADD REPLY • link 4.3 years ago by cpad0112 21k

score 4 · Accepted Answer · 2020-01-19

4

Entering edit mode

4.3 years ago

WouterDeCoster 47k

It is best to use an as complete genome as possible for alignment. That means you should include the unplaced contigs. As such you are sure that you don't get any false-positive alignments on the "real" chromosomes and as such false-positive variants. See also this blog post from Heng Li to learn what matters and which reference genome you should use: https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

But after alignment and variant calling it is safe to ignore those: Of course those variants can be real and have real consequences, but you probably shouldn't trust them too much and may want to focus on the better-understood and annotated "real" chromosomes.

ADD COMMENT • link 4.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi Wouter. In your knowledge do you know some paper discussing/benchmarking the impact of using hg38 (or hg19) with/without unplaced and unlocalized sequences on variant calling ?

ADD REPLY • link 4.3 years ago by Nicolas Rosewick 11k

1

Entering edit mode

I'll think about it. Not exactly what you are asking for, but this paper comes to mind: Ameur et al. 2018. Short summary: they de novo assemble two genomes, and find contigs missing from the reference genome. Inclusion of those contigs has a large impact on short-read alignment and variant calling in the rest of their cohort, suggesting incomplete reference creates both false positive and false negative variants.

ADD REPLY • link 4.3 years ago by WouterDeCoster 47k

1

Entering edit mode

I think paper summary is in these two lines: Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Conclusion of the paper is GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data. (Conclusions are copy/pasted from the manuscript abstract)

Authors discovered 10 Mb sequence that is/was not part of GRCh38 in swedish population part of which is shared by Chinese genome efforts.

ADD REPLY • link 4.3 years ago by cpad0112 21k