Question: Karyotypically Ordered Hg19
gravatar for Edoardo Giacopuzzi
9.2 years ago by
Brescia, Italy
Edoardo Giacopuzzi60 wrote:


I have a .bam file generated using the hg19 release of human genome downloaded from UCSC database. Now I'm trying to indentificate genomic variants using GATK UnifiedGenotyper but it returns an error cause my bam file is Lexicographically sorted. I've tried to use Picard ReorderSam function but I've noted that my reference genome file (hg19) is in the same lexicographic order.

Is there any way to convert my hg19 fasta file in the karyotypic order? Or is there any place where I can download a version of the reference human genome sorted in this way?


gatk reference hg • 9.5k views
ADD COMMENTlink modified 5.9 years ago by Biostar ♦♦ 20 • written 9.2 years ago by Edoardo Giacopuzzi60
gravatar for Cyriac Kandoth
6.9 years ago by
Cyriac Kandoth5.5k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth5.5k wrote:

The reference FASTAs downloadable from Ensembl are not karyotypically sorted, but that can be fixed with a little bit of Perl...

Download and unzip the Build37 (hg19) reference FASTA file from Ensembl release 72:

curl -LO
gunzip Homo_sapiens.GRCh37.72.dna.primary_assembly.fa.gz

Use this Perl one-liner that splits the FASTA file per chrom/contig into a temporary folder, and then concatenates them in karyotypic order:

perl -e 'use File::Temp qw/tempdir/; use IO::File; $d=tempdir; $fh; map{if(m/^\>(\S+)\s/){$fh=IO::File->new(">$d/$1.fa");} print $fh $_;}`cat Homo_sapiens.GRCh37.72.dna.primary_assembly.fa`; foreach $c(1..22,X,Y,MT){print `cat $d/$c.fa`}; print `cat $d/GL*`' > Homo_sapiens.GRCh37.72.dna.primary_assembly.reordered.fa
ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Cyriac Kandoth5.5k

Hi, thanks for your code. I have numerous scaffolds that begin with KI as well as GL. eg 26 KI270728.1 1872759 3091464192 27 KI270727.1 448248 3093561344 28 KI270442.1 392061 3094085632 29 KI270729.1 280839 3094609920 30 GL000225.1 211173 3095134208 31 KI270743.1 210658 3095396352 32 GL000008.2 209709 3095658496 33 GL000009.2 201709 3095920640 34 KI270747.1 198735 3096182784 35 KI270722.1 194050 3096444928 36 GL000194.1 191469 3096707072 37 KI270742.1 186739 3096969216 38 GL000205.2 185591 3097231360 39 GL000195.1 182896 3097493504 40 KI270736.1 181920 3097755648 41 KI270733.1 179772 3098017792 42 GL000224.1 179693 3098279936 43 GL000219.1 179198 3098542080

How do I concatenate these in the correct order? I tried print cat $d/KI*&GL*' > GRCh38.reordered.fa but got an error -

GL*: not found

ADD REPLYlink written 3.6 years ago by zoegward100

Anything within backticks are simply shell commands that run and return with their standard output for Perl to parse. So in this case, you just add one more argument to the Unix cat command. Like this - cat $d/GL* $d/KI*

Btw, I don't recommend using in-line Perl in a production workflow. Besides, good bioinformatics tools should be able to handle FASTAs without karyotypically ordered chromosomes.

ADD REPLYlink written 3.6 years ago by Cyriac Kandoth5.5k
gravatar for Brad Chapman
9.2 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

Broad makes one available from their FTP site:

(Just hit enter if asked for a password)

They also have hg18 plus lots of other supporting data:

ADD COMMENTlink written 9.2 years ago by Brad Chapman9.5k
gravatar for Pierre Lindenbaum
9.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

you can create your human genome by running the following shell:

for C in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
   curl "${C}.fa.gz" |\
   gunzip -c >> hg19.fa
ADD COMMENTlink written 9.2 years ago by Pierre Lindenbaum130k

Please, please include unlocalized and unplaced contigs. They are part of the primary human assembly.

ADD REPLYlink written 9.2 years ago by lh332k

How does GATK define 'karyotypic' sorting? Where should unlocalised contigs etc be placed? The question is also relevant when using other naming conventions e.g. Ensembl.

ADD REPLYlink written 9.2 years ago by Travis2.8k

Can I please reask Travis's question? For some reason Ensembl releases their 'toplevel' genome file in a format that sorts the chromosomes, patches and unlocalised contigs in order of ascending length which, depending on the definition of 'karyotypic' sorting, may (or may not) be the opposite of what GATK requires.

ADD REPLYlink written 8.9 years ago by blackgore60

open a new question, citing that one.

ADD REPLYlink written 8.9 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 798 users visited in the last hour