Question: Human dna reference file with no prefix 'chr'
7
gravatar for mangfu100
4.8 years ago by
mangfu100710
Korea, Republic Of
mangfu100710 wrote:

Hi.

I have usually been using hg19 reference file from ucsc.

However, ucsc has a prefix 'chr' in its files.

Of course, I am able to exclude 'chr' prefix by script programming, but I don't want to.

Instead, I need another human reference file which does not contain 'chr' prefix itself.

Do any another reference files exist not ucsc?

 

sequencing next-gen • 12k views
ADD COMMENTlink modified 3.8 years ago by Biostar ♦♦ 20 • written 4.8 years ago by mangfu100710

The GATK bundle has what you need 

 

https://www.broadinstitute.org/gatk/download

ADD REPLYlink written 4.8 years ago by Zev.Kronenberg11k

Is there a reason why you wanna bypass something as simple as a sed 's/chr//g' ucsc.hg19.fa >ucsc.hg19.nochr.fa ?

ADD REPLYlink written 4.8 years ago by RamRS23k
1

I think that just simply excluding of 'chr' would make any problem later for processing.

and I also though that users must match the reference file that was used for mapping program such as bwa with others variation calling program.

ADD REPLYlink written 4.8 years ago by mangfu100710
14
gravatar for Cyriac Kandoth
4.8 years ago by
Cyriac Kandoth5.3k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth5.3k wrote:

It's never as simple as "remove chr-prefix with a script". hg19 is UCSC's variant of the official GRCh37 assembly. Early releases of GRCh37 like GRCh37-lite, did not use chr-prefixes, but newer releases like GRCh37.p13 adopted the chr-prefix, and use a newer mitochondrial (MT) sequence than hg19 does. Note also how chrM in hg19 is named MT in GRCh37. And all the unplaced contigs have very different names. So simply removing the chr-prefix in hg19 does not make it GRCh37. It makes it a wholly other chromosome naming convention, which is the last thing we need right now.

Update (Nov 4, 2016): Here is a UCSC Chain mapping UCSC's hg19 to Ensembl's GRCh37.p13 (no chr-prefix), compatible with tools like CrossMap, Remap, or liftOver. Notice how all chromosomes/contigs except chrM only require renaming. Users of vcf2maf with hg19 VCFs as input, can pass this into the --remap-chain argument.

ADD COMMENTlink modified 18 months ago • written 4.8 years ago by Cyriac Kandoth5.3k

It's never as simple as "remove chr-prefix with a script".  That doesn't make sense, because sometimes it really is that simple. There have been plenty of occasions where a simple find/replace is all I've needed to do to get a BED file working in tandem with a BAM file in a downstream-analysis step (after ensuring I was indeed working from the same reference).

My pedantry aside, the info you provided was very useful.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Dan D6.8k

Agreed... it's often simple to just "get something working", night before the lab meeting. ;) But it's a whole other ballgame when you're trying to develop robust bioinformatics pipelines that can handle every pedantic detail.

ADD REPLYlink written 4.8 years ago by Cyriac Kandoth5.3k
14
gravatar for Dan D
4.8 years ago by
Dan D6.8k
Tennessee
Dan D6.8k wrote:

Why would you go through all the trouble of finding/downloading/cat'ing another reference? Getting rid of that chr designation is dead simple. 

For FASTA files:

cat hg19.fa | sed 's/>chr/>/g' > hg19_new.fa

For BED files:

cat myhg19Genes.bed sed 's/^chr//' > myhg19Genes_new.bed

For BAM files:

samtools view -h hg19Alignments.bam | sed 's/chr//g' | samtools view -Shb - -o hg19Alignments_new.bam

 

ADD COMMENTlink written 4.8 years ago by Dan D6.8k
1

For BAM files, it might be faster to modify the header with sed, write that to a file, and then use samtools reheader (there's no need to then parse everything to/from sam). Of course, that doesn't fit as nicely into a single command as your solution :)

ADD REPLYlink written 4.8 years ago by Devon Ryan91k
1

Modifying only the header is also safer, so that you don't risk truncating base quality values that happen to form the word chr (very unlikely, but still possible). But even in the header, you risk changing comment lines in the @CO tag, or file names or commands in the @PG tag, containing words like "chrom", "chromosome", or "jesus christ! why is this so complicated?!"

ADD REPLYlink written 4.8 years ago by Cyriac Kandoth5.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1051 users visited in the last hour