Is it possible to guess sequencing platform used based on a FASTQ/BAM file?
0
2
Entering edit mode
8.8 years ago
Andrew ▴ 60

Purely based on required header information and data/stats pulled from the BAM file, is there any way to guess the sequencing platform (454, Illumina, Ion Torrent, etc.) used to generate data for a BAM file? Does a tool already exist that does this?

So far all I can find is average read lengths and number produced, and error rate, which vary from platform to platform. Also I thought encoding quality may be useful in this guess too. I found this How To Determine The Version Used To Generate Solexa/Illumina Fastq Files? to be useful, though this too is just a guess of what the encoding could be.

Any ideas of other stats that may be useful would be extremely appreciated.

Thanks!

BAM FASTQ • 3.6k views
ADD COMMENT
2
Entering edit mode

To a certain extent this is often possible. Read name formatting is often machine dependent.

ADD REPLY
1
Entering edit mode

Keep in mind SAM and BAM files may include reads from different runs, samples, technologies, etc.

ADD REPLY
1
Entering edit mode

This should be interesting - finding discrete patterns (or sets of patterns) to predict data sources.

ADD REPLY
1
Entering edit mode

The SAM spec has tags for the read group (RG) field in the header that could help: platform/technology (PL) and platform model (PM). These are usually filled in by the aligner or user that made the file, so you're completely at their mercy.

As suggested, read naming schemes are usually machine-dependent. This depends on having the raw data, though. In some cases the reads may be relabeled with uninformative names, and then you're out of luck. For example, the SRA does this, and I've seen published datasets that have been aggressively filtered with renamed reads.

ADD REPLY
1
Entering edit mode

Interesting question - it is very likely that it would be possible to detect the platform from the data itself - for example adapter contamination (see if a few of your reads end with GATCGGAA the Illumina adapter), the error distribution, read lengths and orientations (the 454 produces variable read lengths) and many other information combined could help identify the platform. But there is probably no tool to do this - since it just not what scientists use the data for.

ADD REPLY

Login before adding your answer.

Traffic: 1523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6