Question: On conventions for human chromosome numbering in file names (1 vs 01, etc.)
0
gravatar for kynnjo
3 months ago by
kynnjo20
United States
kynnjo20 wrote:

I am relatively new to bioinformatics, though I have been doing scientific programming for a few years now.

The following example is illustrative of a recurring situation:

% ls -1
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr20.txt
chr21.txt
chr22.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrx.txt
chry.txt

Note how these file names, by default, don't get listed in the normal numeric ordering. To put it differently, their lexicographic and numeric orderings do not coincide.

My temptation (which may be signs of a "professional deformation") is to rename those files to something like this:

% ls -1
chr01.txt
chr02.txt
chr03.txt
chr04.txt
chr05.txt
chr06.txt
chr07.txt
chr08.txt
chr09.txt
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr20.txt
chr21.txt
chr22.txt
chrx.txt
chry.txt

...or maybe even this:

% ls -1
chr01.txt
chr02.txt
chr03.txt
chr04.txt
chr05.txt
chr06.txt
chr07.txt
chr08.txt
chr09.txt
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr20.txt
chr21.txt
chr22.txt
chr_x.txt
chr_y.txt

...so that names sort naturally in numeric order, and (much less importantly), they line up when printed in a column.

Putting aside the fact that much in-house bioinformatics code out there is already dependent on the 1, 2, 3-type numbering, would it be an abomination in the eyes of most bioinformaticians to use 01, 02, 03, etc. instead of 1, 2, 3, etc. to number human chromosomes?

genomics bioinformatics • 250 views
ADD COMMENTlink modified 3 months ago by Pierre Lindenbaum117k • written 3 months ago by kynnjo20
1

It may be tempting, but don't rely on the file system to order things for you. Use -v in GNU tools like ls and sort, for instance, to sort file names that have prefixes that you want to order "naturally". See: https://www.gnu.org/software/coreutils/manual/html_node/Details-about-version-sort.html

ADD REPLYlink written 3 months ago by Alex Reynolds27k
5
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum117k wrote:

are you going to work with the data from ensembl ? use the Ensembl nomenclature (1,2,3,..)

are you going to work with the data from ucsc ? use the ucsc nomenclature (chr1,chr2,chr3,..)

are you going to work with the specific reference ? use the order and the names from this reference.

are you going to work with more than one nomenclature ? use whatever you want. Some tools can use more than one nomenclature: eg. - https://samtools.github.io/hts-specs/SAMv1.pdf @CSQ/AN

Alternative reference sequence names. A comma-separated list of alternative names that tools may use when referring to this reference sequence.

ADD COMMENTlink written 3 months ago by Pierre Lindenbaum117k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour