human genome toplevel files
1
2
Entering edit mode
9.2 years ago
biolab ★ 1.4k

Hi, everyone,

I am new to use human genome files. I found there are three versions of whole human genome files on Ensembl (ftp://ftp.ensembl.org/pub/release-78/fasta/homo_sapiens/dna/): toplevel, hard-masked toplevel, and soft-masked toplevel. The sizes are quite different. Could anyone please briefly describe to me the differences?

In addition, where is the gff file for download? The above site has a gff file with many regulatory features, e.g., histone methylations. I only need exon, intron, UTRs information. Thank you very much!

genome human • 5.5k views
ADD COMMENT
14
Entering edit mode
9.2 years ago

The sizes are only different because of file compression. "Masking" refers to manipulating a region in a sequence in some way. Typically, this is done with repeat and low complexity regions so that some aligners (e.g., blast) can avoid them. There are two ways to mask a region in a fasta file. Firstly, one can write its bases in lower case (e.g., "acgt") rather than upper case (e.g., "ACGT"). This is called soft-masking. Secondly, one could instead simply replace repetitive/low complexity regions with an N, termed hard masking. For most cases you'll want to use either the soft-masked or unmasked reference files. If you're using tools like BWA, or tophat or bowtie2 (i.e., almost anything meant to handle NGS data) then the results from using a soft-masked and unmasked reference will be identical (most of these tools simply ignore a base's case). However, should you ever need to use a tool that accounts for masking, then already having a soft-masked genome downloaded can be convenient. For that reason, I personally tend to download the soft-masked versions just so I don't have to bother ever downloading them later.

ADD COMMENT
0
Entering edit mode

Hi Devon,

Your explanations are really helpful. I have just one more question: you mentioned that soft-masked or unmasked genome file should have not effects on mapping (using either tophat2 or bwa), so how's hard masked reference? What's the side-effects when using the hard-masked file? THANKS a lot!

ADD REPLY
2
Entering edit mode

It's generally a bad idea to use hard-masked files. You're not going to get alignments to stretches of N, so any sequence that you do see that arose from such a region may incorrectly align elsewhere. So using a hard-masked genome is expected to decrease overall mapping quality. The only benefit is that you can map things a bit faster, but that's often a bad trade off.

ADD REPLY
0
Entering edit mode

Hi Devon,

Yes, it's bad to use the hard-masked file. Thanks a lot for your detailed answer.

ADD REPLY
0
Entering edit mode

I am very satisfied with the explanation. Thank you for the easy explanation.

ADD REPLY

Login before adding your answer.

Traffic: 1754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6