Question: Tool that detects data types
1
gravatar for fitziano
3.0 years ago by
fitziano20
Switzerland
fitziano20 wrote:

Hi guys

I am looking for a tool that is able to automatically detect and validate type of a file.

More specifically I am interested for detecting between SAM, BAM, FASTA, FASTQ (if possible descriminating between these types), BED, BED12, BED15, GFF, GFF2, GFF3.

The detection should be performed without just looking at the extension of the file....

Is there an easy way to accomplish something like this?

Thank you in advance

rna-seq tool data type • 1.4k views
ADD COMMENTlink modified 3.0 years ago by Pierre Lindenbaum96k • written 3.0 years ago by fitziano20
1

best way would be to build a magic database for `file`: http://linux.die.net/man/5/magic http://linux.die.net/man/1/file . But IMHO, I've always find 'magic' too complicated.

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum96k
1

asked on SE: http://unix.stackexchange.com/questions/154001

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum96k
3
gravatar for Pierre Lindenbaum
3.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

OK from what I got from SE; you can create and compile a magic file the following way.

Create a file 'bioinfo' containing the pattern for your format. Example: from the specification a BAM file starts at 1st byte (position=0) with 4 bytes: BAM\1:

0	string	@HD\tVN:1.0\ SO:coordinate	SAM file v1.0 sorted on coordinates
0	string	@HD\tVN:1.0\ SO:coordinate	SAM file v1.0 sorted on coordinates
0	string	BAM\1	BAM file v1.0
0	string	CRAM\2\1	CRAM 2.1 file
0	string	BAI\1	BAM index file v1.0
0	regex	[\r\n]^>.*\n[ATGCatgc]*\n	Fasta DNA sequence
0	string	\#\#fileformat=VCFv4\.1 VCF format 4.1
0	string	BCF\4	BCF file v1.0
0	string	BCF\2\1	BCF file v2.0
0	string	TBI\1	Tabix index file v1.0

compile the magic file 'bioinfo.mgc':

file -C -m bioinfo

use this magic file :

file -z -m bioinfo.mgc ex1.sam.gz
ex1.sam.gz: SAM file v1.0 sorted on coordinates (data)
file -z -m bioinfo.mgc file.bam file.bam: BAM file v1.0 (data)
file -z -m bioinfo.mgc file.fasta file.fasta: Fasta DNA sequence
file -z -m bioinfo.mgc file.vcf
file.vcf: VCF format 4.1
file -z -m bioinfo.mgc file.gz.tbi
file.gz.tbi: Tabix index file v1.0 (data) 

UPDATE: I started a git repo to store the bioinfo formats: https://github.com/lindenb/magic

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Pierre Lindenbaum96k
2

That is neat.  To make more people benefit from this, could you consider submitting the magic strings directly to http://www.darwinsys.com/file/ and http://freedesktop.org/wiki/Software/shared-mime-info/ ?  This would make these patterns recognised by default on many Linux distributions. One could even consider to submit a media type for SAM, BAM, etc. files to the IANA (Internet Assigned Numbers Authority). In that case, the best would be to do it through the current maintainers of the SAM specification.

ADD REPLYlink written 2.9 years ago by Charles Plessy2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 700 users visited in the last hour