Tool that detects data types
1
1
Entering edit mode
9.6 years ago
fitziano ▴ 40

Hi guys

I am looking for a tool that is able to automatically detect and validate type of a file.

More specifically I am interested for detecting between SAM, BAM, FASTA, FASTQ (if possible discriminating between these types), BED, BED12, BED15, GFF, GFF2, GFF3.

The detection should be performed without just looking at the extension of the file....

Is there an easy way to accomplish something like this?

Thank you in advance

datatype RNA-Seq • 3.7k views
ADD COMMENT
1
Entering edit mode

best way would be to build a magic database for `file`: http://linux.die.net/man/5/magic http://linux.die.net/man/1/file . But IMHO, I've always find 'magic' too complicated.

ADD REPLY
1
Entering edit mode
ADD REPLY
3
Entering edit mode
9.6 years ago

OK from what I got from SE; you can create and compile a magic file the following way.

Create a file 'bioinfo' containing the pattern for your format. Example: from the specification a BAM file starts at 1st byte (position=0) with 4 bytes: BAM\1:

0	string	@HD\tVN:1.0\ SO:coordinate	SAM file v1.0 sorted on coordinates
0	string	@HD\tVN:1.0\ SO:coordinate	SAM file v1.0 sorted on coordinates
0	string	BAM\1	BAM file v1.0
0	string	CRAM\2\1	CRAM 2.1 file
0	string	BAI\1	BAM index file v1.0
0	regex	[\r\n]^>.*\n[ATGCatgc]*\n	Fasta DNA sequence
0	string	\#\#fileformat=VCFv4\.1 VCF format 4.1
0	string	BCF\4	BCF file v1.0
0	string	BCF\2\1	BCF file v2.0
0	string	TBI\1	Tabix index file v1.0

compile the magic file 'bioinfo.mgc':

file -C -m bioinfo

use this magic file :

file -z -m bioinfo.mgc ex1.sam.gz
ex1.sam.gz: SAM file v1.0 sorted on coordinates (data)
file -z -m bioinfo.mgc file.bam file.bam: BAM file v1.0 (data)
file -z -m bioinfo.mgc file.fasta file.fasta: Fasta DNA sequence
file -z -m bioinfo.mgc file.vcf
file.vcf: VCF format 4.1
file -z -m bioinfo.mgc file.gz.tbi
file.gz.tbi: Tabix index file v1.0 (data) 

UPDATE: I started a git repo to store the bioinfo formats: https://github.com/lindenb/magic

ADD COMMENT
2
Entering edit mode

That is neat. To make more people benefit from this, could you consider submitting the magic strings directly to http://www.darwinsys.com/file/ and http://freedesktop.org/wiki/Software/shared-mime-info/? This would make these patterns recognised by default on many Linux distributions. One could even consider to submit a media type for SAM, BAM, etc. files to the IANA (Internet Assigned Numbers Authority). In that case, the best would be to do it through the current maintainers of the SAM specification.

ADD REPLY

Login before adding your answer.

Traffic: 2741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6