MiSeq runs contain some report files in XML format. In particular, Data/Intensities/BaseCalls/Alignment/GenerateFASTQRunStatistics.xml
indicates various numbers such as the count of reads passing filter, for the whole run and for each index separately. Here is an oversiplified example.
<StatisticsGenerateFASTQ>
<RunStats>
<NumberOfClustersPF>11433659</NumberOfClustersPF>
<NumberOfClustersRaw>11969395</NumberOfClustersRaw>
</RunStats>
<OverallSamples>
<SummarizedSampleStatistics>
<NumberOfClustersPF>43181</NumberOfClustersPF>
<NumberOfClustersRaw>49080</NumberOfClustersRaw>
<SampleNumber>1</SampleNumber>
</SummarizedSampleStatistics>
<SummarizedSampleStatistics>
<NumberOfClustersPF>79129</NumberOfClustersPF>
<NumberOfClustersRaw>85016</NumberOfClustersRaw>
<SampleNumber>2</SampleNumber>
</SummarizedSampleStatistics>
</OverallSamples>
<PairedEndByGenome />
<Samples />
</StatisticsGenerateFASTQ>
Using the XmlStarlet comand-line tool, I could extract the total number of clusters passing filter with the command xmlstarlet sel -t -v //RunStats/NumberOfClustersPF GenerateFASTQRunStatistics.xml
. However, to do so for a given index, one needs to first identify which SummarizedSampleStatistics
node contains a SampleNumber
child that contains a given index number, and then extract the value from the NumberOfClustersPF
sibling node.
Does anybody know an XPath expression doing that ? (so that it could become a one-liner with XmlStarlet). I tried things like //ancestor::SampleNumber[.="2"]/NumberOfClustersPF
without success...