Question: human genome toplevel files
gravatar for biolab
4.2 years ago by
biolab1.1k wrote:

Hi, everyone,

I am new to use human genome files.  I found there are three versions of whole human genome files on Ensembl ( toplevel, hard-masked toplevel, and soft-masked toplevel.  The sizes are quite different. Could anyone please briefly describe to me the differences?

In addition, where is the gff file for download? The above site has a gff file with many regulatory features, e.g., histone methylations. I only need exon, intron, UTRs information.  Thank you very much!

human genome • 2.9k views
ADD COMMENTlink modified 4.2 years ago by Devon Ryan89k • written 4.2 years ago by biolab1.1k
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

The sizes are only different because of file compression. "Masking" refers to manipulating a region in a sequence in some way. Typically, this is done with repeat and low complexity regions so that some aligners (e.g., blast) can avoid them. There are two ways to mask a region in a fasta file. Firstly, one can write its bases in lower case (e.g., "acgt") rather than upper case (e.g., "ACGT"). This is called soft-masking. Secondly, one could instead simply replace repetitive/low complexity regions with an N, termed hard masking. For most cases you'll want to use either the soft-masked or unmasked reference files. If you're using tools like BWA, or tophat or bowtie2 (i.e., almost anything meant to handle NGS data) then the results from using a soft-masked and unmasked reference will be identical (most of these tools simply ignore a base's case). However, should you ever need to use a tool that accounts for masking, then already having a soft-masked genome downloaded can be convenient. For that reason, I personally tend to download the soft-masked versions just so I don't have to bother ever downloading them later.

ADD COMMENTlink written 4.2 years ago by Devon Ryan89k

Hi Devon,

Your explanations are really helpful.  I have just one more question: you mentioned that soft-masked or unmasked genome file should have not effects on mapping (using either tophat2 or bwa), so how's hard masked reference?  What's the side-effects when using the hard-masked file? THANKS a lot!

ADD REPLYlink written 4.2 years ago by biolab1.1k

It's generally a bad idea to use hard-masked files. You're not going to get alignments to stretches of N, so any sequence that you do see that arose from such a region may incorrectly align elsewhere. So using a hard-masked genome is expected to decrease overall mapping quality. The only benefit is that you can map things a bit faster, but that's often a bad trade off.

ADD REPLYlink written 4.2 years ago by Devon Ryan89k

Hi Devon,

Yes, it's bad to use the hard-masked file. Thanks a lot for your detailed answer.

ADD REPLYlink written 4.2 years ago by biolab1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1173 users visited in the last hour