Question: Request for karyotypically sorted Ensembl reference fasta and its dbSNP vcf for GATK workflow
gravatar for umn_bist
3.6 years ago by
umn_bist330 wrote:

I have RNA-seq bam files that I need to call somatic  variants. The problem is that GATK is very strict with how the bam is formatted (karyotypically sorted, no 'chr' notation, read group).

Because my bam file was aligned against Ensembl reference I keep running into validation errors. For example I have to change the chromosome notation in the header which I am hesitant after many failures (samtools view --> sed --> reheader) and I am stuck on error as well:

"Discordant contig lengths: read MT LN=16571, ref MT LN=16569" (note that I was referencing against GATK's homo sapiens hg19 reference)

Does anyone have an Ensembl reference and its corresponding dbSNP useable for GATK? There is the Ensembl ftp I can access but I am quite lost with which files are the right ones. Thank you very much for your help.

ensembl rna-seq dbsnp grch37 gatk • 1.5k views
ADD COMMENTlink modified 3.6 years ago by Emily_Ensembl18k • written 3.6 years ago by umn_bist330

See  You may want the "toplevel" version.  

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Sean Davis25k

I downloaded Homo_sapiens.GRCh37.75.dna.toplevel.fa but it is lexicographically sorted.

ADD REPLYlink written 3.6 years ago by umn_bist330

If you really want something that requires no work to get working with GATK, you can download the GATK resource bundle.

Your choice of reference(s) will be limited, though.

ADD REPLYlink written 3.6 years ago by Sean Davis25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1913 users visited in the last hour