4.2 years ago by
The sizes are only different because of file compression. "Masking" refers to manipulating a region in a sequence in some way. Typically, this is done with repeat and low complexity regions so that some aligners (e.g., blast) can avoid them. There are two ways to mask a region in a fasta file. Firstly, one can write its bases in lower case (e.g., "acgt") rather than upper case (e.g., "ACGT"). This is called soft-masking. Secondly, one could instead simply replace repetitive/low complexity regions with an N, termed hard masking. For most cases you'll want to use either the soft-masked or unmasked reference files. If you're using tools like BWA, or tophat or bowtie2 (i.e., almost anything meant to handle NGS data) then the results from using a soft-masked and unmasked reference will be identical (most of these tools simply ignore a base's case). However, should you ever need to use a tool that accounts for masking, then already having a soft-masked genome downloaded can be convenient. For that reason, I personally tend to download the soft-masked versions just so I don't have to bother ever downloading them later.