On conventions for human chromosome numbering in file names (1 vs 01, etc.)
1
0
Entering edit mode
2.9 years ago
kynnjo ▴ 40

I am relatively new to bioinformatics, though I have been doing scientific programming for a few years now.

The following example is illustrative of a recurring situation:

% ls -1
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr20.txt
chr21.txt
chr22.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrx.txt
chry.txt


Note how these file names, by default, don't get listed in the normal numeric ordering. To put it differently, their lexicographic and numeric orderings do not coincide.

My temptation (which may be signs of a "professional deformation") is to rename those files to something like this:

% ls -1
chr01.txt
chr02.txt
chr03.txt
chr04.txt
chr05.txt
chr06.txt
chr07.txt
chr08.txt
chr09.txt
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr20.txt
chr21.txt
chr22.txt
chrx.txt
chry.txt


...or maybe even this:

% ls -1
chr01.txt
chr02.txt
chr03.txt
chr04.txt
chr05.txt
chr06.txt
chr07.txt
chr08.txt
chr09.txt
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr20.txt
chr21.txt
chr22.txt
chr_x.txt
chr_y.txt


...so that names sort naturally in numeric order, and (much less importantly), they line up when printed in a column.

Putting aside the fact that much in-house bioinformatics code out there is already dependent on the 1, 2, 3-type numbering, would it be an abomination in the eyes of most bioinformaticians to use 01, 02, 03, etc. instead of 1, 2, 3, etc. to number human chromosomes?

bioinformatics genomics • 766 views
1
Entering edit mode

It may be tempting, but don't rely on the file system to order things for you. Use -v in GNU tools like ls and sort, for instance, to sort file names that have prefixes that you want to order "naturally". See: https://www.gnu.org/software/coreutils/manual/html_node/Details-about-version-sort.html

5
Entering edit mode
2.9 years ago

are you going to work with the data from ensembl ? use the Ensembl nomenclature (1,2,3,..)

are you going to work with the data from ucsc ? use the ucsc nomenclature (chr1,chr2,chr3,..)

are you going to work with the specific reference ? use the order and the names from this reference.

are you going to work with more than one nomenclature ? use whatever you want. Some tools can use more than one nomenclature: eg. - https://samtools.github.io/hts-specs/SAMv1.pdf @CSQ/AN

Alternative reference sequence names. A comma-separated list of alternative names that tools may use when referring to this reference sequence.