Question: Is it possible to guess sequencing platform used based on a FASTQ/BAM file?
gravatar for Andrew
5.4 years ago by
Andrew50 wrote:

Purely based on required header information and data/stats pulled from the BAM file, is there any way to guess the sequencing platform (454, Illumina, Ion Torrent, etc.) used to generate data for a BAM file? Does a tool already exist that does this?

So far all I can find is average read lengths and number produced, and error rate, which vary from platform to platform.  Also I thought encoding quality may be useful in this guess too.  I found this How To Determine The Version Used To Generate Solexa/Illumina Fastq Files? to be useful, though this too is just a guess of what the encoding could be.

Any ideas of other stats that may be useful would be extremely appreciated.


bam fastq • 2.3k views
ADD COMMENTlink modified 5.3 years ago by Biostar ♦♦ 20 • written 5.4 years ago by Andrew50

To a certain extent this is often possible. Read name formatting is often machine dependent.

ADD REPLYlink written 5.4 years ago by Devon Ryan97k

Keep in mind SAM and BAM files may include reads from different runs, samples, technologies, etc.

ADD REPLYlink written 5.4 years ago by h.mon31k

This should be interesting - finding discrete patterns (or sets of patterns) to predict data sources.

ADD REPLYlink written 5.4 years ago by _r_am31k

The SAM spec has tags for the read group (RG) field in the header that could help: platform/technology (PL) and platform model (PM).  These are usually filled in by the aligner or user that made the file, so you're completely at their mercy.

As suggested, read naming schemes are usually machine-dependent.  This depends on having the raw data, though.  In some cases the reads may be relabeled with uninformative names, and then you're out of luck.  For example, the SRA does this, and I've seen published datasets that have been aggressively filtered with renamed reads.

ADD REPLYlink written 5.4 years ago by matted7.3k

Interesting question - it is very likely that it would be possible to detect the platform from the data itself - for example adapter contamination (see if a few of your reads end with GATCGGAA  the Illumina adapter), the error distribution, read lengths and orientations (the 454 produces variable read lengths) and many other information combined could help identify the platform. But there is probably no tool to do this - since it just not what scientists use the data for.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Istvan Albert ♦♦ 85k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2267 users visited in the last hour