Question

checkstrand tool (from BBMap suite) question

1

Entering edit mode

6 months ago

GenoMax 141k

Brian Bushnell : I was looking at checkstrand.sh application included in the latest version of BBMap. While running a test analysis using a human sample I noticed that the output for the tool was identical in following two scenarios using a single end (R1) file.

Using a transcriptome file and transcriptome=t option
Using just sequence file

Here is example output I got.

Depth_Analysis:
Strandedness:   94.47%
AvgKmerDepth:   3.973
Kmers>Depth1:   0.5003

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     75.00
AvgORFLen+:     70.15
AvgORFLen-:     73.84
AvgStopCount+:  0.2492
AvgStopCount-:  0.2823
GC_Content:     0.4780

PolyA_Analysis:
MajorStrandPA:  Plus
PolyA/(PA+PT):  0.680778
PolyAFraction:  0.001290

Read_Gene_Calling_Analysis:
MajorStrandRGC: Plus
P/(P+M)_Ratio:  0.569905
AvgScorePlus:   52.389245
AvgScoreMinus:  48.959957
UsedFraction:   0.000315

For a biologist this output is going to be confusing. Which of these stats should be used to for defining strandedness of a library?

Some of the moderators have expressed a desire to know how this tool works. It is apparent that it is based on k-mer analysis but can you add a paragraph with some explanation?

bbmap checkstrand.sh bbtools • 819 views

ADD COMMENT • link updated 6 months ago by Brian Bushnell 20k • written 6 months ago by GenoMax 141k

GenoMax · Answer 1 · 2023-10-16

The top part should be highly accurate in all cases; the rest are more data-dependent. In this case you have roughly 94% strandedness but it's hard to say whether that's plus or minus strand (for read 1). Actually, the shellscript has a decent explanation inside - because this tool is a lot more confusing than most, I tried to be fairly explicit instead of just listing the usage flags as usual, but I do try to keep them concise so I'll go into some more detail here.

Block 1: Depth Analysis

Using BBSketch libraries, a sketch (essentially, a subset of kmers) is made of the library. This is a normal - or, I'll refer to it as canonical - sketch, meaning that every kmer is compared to its reverse complement, then whichever is greater (in 2-bit encoding) is used as the canonical version. That is hashed, and the hashcodes are stored, sorted, and the top X hashcodes (in this case, 20000 by default) plus their counts are retained, and that's the sketch. Concurrently, a second sketch is also made, just using the forward kmers. Now we have two arrays each of 20000 hashcodes and counts.

These sketches are compared to each other by walking down the array and seeing if the hashcodes are equal. In randomly-stranded data you would expect roughly 50% of the forward codes to be present in the canonical array - and vice-versa - but it's a different 50%, since there are twice as many forward kmer hashcodes as canonical kmer hashcodes in the code space (since half of the forward kmers would be instead transformed into reverse kmers). For example, AACCT and AGGTT could both be present in the forward sketch, but only AGGTT would be present in the canonical sketch since those are reverse-complements and AGGTT is bigger. So anyway, as you walk down the arrays, you might see AACCT's hashcode in the forward array, and it would get ignored since it's not in the canonical array. If you see AGGTT in the forward array, it would also be present in the canonical array (that's mandatory). Though if you see AGGTT in the canonical array it might not be present in the forward array.

In this example, let's say AGGTT is in both arrays, with a count of 10 in the canonical array and 5 in the forward array. That's (almost) what we would expect from unstranded data - you see the kmer 5 times in the forward orientation and 5 in the reverse orientation. On the other hand, if we saw GGGCC in the canonical array but NOT in the forward array, that means we encountered it 10 times in the reverse orientation and 0 times in the forward orientation (note that we stop walking when we go off the end of either array, so some elements are ignored in one of the arrays).

In the above case, we have 2 pairs of counts: 10 and 5, and 10 and 0. In each case we select the lower of the two counts (5 and 0) and compare them to the average minor allele frequency of 10 coin flips, which is ~3.269368. Why is it not 5? Well... By minor allele frequency, I mean, if you flip a coin 10 times, just take the lesser of the number of heads or tails, so the maximum possible would be 5 (at 6 it would become the major allele). Therefore the minor allele count could be 0, 1, 2, 3, 4, or 5. Since each has a non-zero chance of occurring, the average minor allele frequency MUST be less than 5. I made a convenient table of them in bbmap/resources/minorAlleleCount.txt so that I don't have to recalculate them. That's based on simulation because using combinatorics is tricky when numbers overflow doubles... if someone wants to post code for an exact answer using combinatorics that would be great too.

Next we make a sum of the actual and expected minor allele counts. In this case we have 2 kmers, each with a total allele count of 10, and minor allele counts of 0 and 5, so the actual sum is 5+0=5 and the expected sum is 3.269368+3.269368=6.538736. At the end, we compare these two numbers, and get a result of:

strandedness=0.5+(1-(minSum/expectedMinSum))*0.5
=0.5+(1-(5/6.538736))*0.5=0.617663

...so the data is ~62% stranded (except that's not accurate because we only used 2 kmer counts).

On average we would expect the minSum to equal the expectedMinSum so you end up with ~50% strandedness. In some rare or contrived cases you could have the minSum greater than expectedMinSum so it can go below 50% but I use a different formula for that case. That corresponds to unlikely scenarios with loaded dice where you flip a coin 10 times and get a minor count of 5 every single time, indicating some kind of loaded dice that strictly alternate between heads and tails. You can get that result by running reformat.sh in=reads.fq out=rcomp.fq rcomp; cat reads.fq rcomp.fq > both.fq which ensures exactly the same number of forward and reverse kmers and thus would be reported as 0% stranded. But a normal library will range between 50% and 100%.

So that's the program in a nutshell! But we just know the percent strandedness, not the dominant strand, which is why I added more tests. Also, I thought it would be neat to have a single tool which gleans absolutely all possible data regarding strandedness from a set of reads.

Block 2: Stop Codon Analysis

Here, you have 75bp reads which are very short for this purpose. But theoretically, reads will tend to have a higher concentration of stop codons on the antisense strand than the sense strand, if you calculate it for all 3 frames on each strand and pick the longest. It depends on the organism because the mRNAs might start some distance before the start codon or end after the stop codon, and might even have another gene on the opposite strand, or could be a non-coding rRNA, so it's just a trend, but for metagenome reads I find it works OK with 150bp reads, and even better with merging turned on to make them longer; in fact, for all of these analyses, it's best to use R1 and R2 and add the "merge" flag. If they overlap the merged read will be used; otherwise, only R1 will be used.

In this case, the average ORF length of the longest frame on the plus and minus strands are similar. Also the average stop count is similar. And the average stop count is greater on minus while ORF length is also greater on minus, which is contradictory, so this is not very informative (though possibly filtering out rRNA would help in this analysis; I might add that capability). So even though the result is "minus", I'd ignore it since the numbers are so close and contradictory, plus the reads are so short. GC is mentioned here because it affects stop codon density; specifically, high-GC organisms won't have as many random stop codons in noncoding areas, which reduces the accuracy.

Block 3: Poly-A Tail Analysis

IF you did some kind of poly-A pulldown, AND you didn't poly-A trim the reads, this MIGHT be relevant depending on the organism (and maybe in other cases too). The PolyAFraction is mentioned because when that's low this block is likely irrelevant (although in the metagenomic data I've looked at, it was accurate even when the number was low; also I don't know remember how low it was or how our metagenomic data was prepared, but it did match the other metrics for major strand prediction). PolyAFraction is the fraction of reads that have a poly-A tail or a poly-T start. I'm using minPolyA=6 as the default cutoff for this expression:

if(trailingPolyA>leadingPolyT && trailingPolyA>minPolyA) {polyACount++;}
else if(trailingPolyA<leadingPolyT && leadingPolyT>minPolyA) {polyTCount++;}

Anyway, as I said, it did work for the metagenomic data I tested it on. This only uses r1 (or the merged read) currently so it might work better if I used r2 to look for the trailing polyA instead. Which would be a polyT in that case, of course.

Block 4: Read Gene-Calling Analysis

This should probably be ignored for eukaryotic data. It uses BBTool's prokaryotic gene-caller (callgenes.sh) to call genes on the plus and minus sides of the read and see which scores better. How well this works depends on how well the organism matches the model... and read length, since gene calling is really hard on 75bp reads, but works a lot better on merged 2x150bp reads. Also, it works far better (and I mean FAR better) if you supply a reference fasta genome which it can use (via the passes=2 flag) to refine the gene model before it tries to do calls on short read. Tn that case it will also give you additional metrics because it can do gene-calling on the reference too. Don't bother feeding it a reference unless you are working with a prok or virus, OR you also supply a gff. Even with a gff this section will (probably) be totally wrong for euks (thought it might work for the organelles) but some other blocks that pop up will be relevant.

GenoMax · Answer 2 · 2023-10-16

1

Entering edit mode

6 months ago

Brian Bushnell 20k

PART II of the answer continued in this post (Edited by GenoMax)

Note that this forum has a 10k character limit, which is unsettling when it bites unexpectedly since it deletes your post :) I lost 300 characters and I'm not sure what they were... Anyway, here's the rest:

Other Blocks: You get more blocks if you provide a reference fasta, optionally with a gff, or if you provide aligned reads. If you provide a reference you must tell the program whether that is a genome or transcriptome. Then, it can make sketches of the forward and reverse transcriptome (which was either provided directly, or cut out from the reference genome using the supplied gff or via internal gene-calling), and those sketches can be compared to the forward sketch of the reads to see which matches better. That's really the most definitive way to determine which is the dominant strand without alignment. If the reads are aligned, you don't need a reference (though you can still provide one). For transcriptome-aligned reads, you simply count the number aligned to the plus versus minus strand, which is very straightforward and accurate. For genome-aligned reads, the alignments are projected onto the transcriptome either using gene-calling (if the genome is supplied, but no gff) or the gff (if one is supplied). This is less accurate in terms of estimating strandedness precisely, especially in proks with lots of overlapping or nearby genes, but still correct in terms of dominant strand determination.

In summary, supplying a reference, gff, and/or aligned reads will give you extra blocks that can help determine strandedness. And some of them are highly accurate. But each of them has caveats or restrictions which is why there are so many, and some of them may contradict each other. Aligned reads are always useful and highly accurate for both percent strandedness and dominant strand determination, but the point of this program is to not need alignment or a reference. A reference is useful if it is a transcriptome, or a genome+gff, or a prok genome, but not a euk genome alone. A gene gff without a reference is only useful for proks OR euks if the reads are aligned (note that I've never actually tested this program with a euk gff).

ADD COMMENT • link updated 6 months ago by GenoMax 141k • written 6 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you for providing this explanation. It should help future visitors.

Guess the caveat here is your testing was with prokaryotic data and you have not tested this with Eukaryotic.

If you provide a reference you must tell the program whether that is a genome or transcriptome.

I provided a transcriptome and did tell the program that it was that. What surprised me was the output looked identical for the two cases I noted above.

Using paired-end data for the same sample resulted in

Loading genes from transcriptome: 3.784 seconds.
Processing 189154 genes.
Finished processing 130402414 reads and 9780181050 bases.

Depth_Analysis:
Strandedness:   94.43%
AvgKmerDepth:   3.998
Kmers>Depth1:   0.4984

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     75.00
AvgORFLen+:     70.15
AvgORFLen-:     73.84
AvgStopCount+:  0.2495
AvgStopCount-:  0.2827
GC_Content:     0.4779

PolyA_Analysis:
MajorStrandPA:  Plus
PolyA/(PA+PT):  0.683487
PolyAFraction:  0.001305

Ref_Analysis:
MajorStrandREF: Minus
P/(P+M)_Ratio:  0.091788
GeneCoverage:   0.5487
GenePrecision:  0.0959

Read_Gene_Calling_Analysis:
MajorStrandRGC: Plus
P/(P+M)_Ratio:  0.569681
AvgScorePlus:   52.336927
AvgScoreMinus:  48.993184
UsedFraction:   0.000315

ADD REPLY • link 6 months ago by GenoMax 141k

0

Entering edit mode

Providing the transcriptome doesn't change any of the other blocks, but it added the Ref Analysis block which indicates that the forward sketch of the reads shares a majority of kmers with the minus strand; specifically, where P is the count of transcriptome plus-strand kmers found in the forward sketch and M is the count of transcriptome minus-strand kmers found in the forward sketch, the ratio

P/(P+M)=0.091788

...indicating minus strand is dominant and by that estimate the strandness is roughly 1-0.091788 or 88%. That's similar to the initial depth analysis result of 94%. It's hard to say which is more accurate (I'd guess the 94% number) because it depends on the quality of the annotation used to derive the transcriptome, as well as some other factors like repeat content (and especially palindromic repeats). But assuming the transcriptome is accurate, we can trust the conclusive result from Ref Analysis that this is minus-strand data and ignore the results from the other blocks. In this section GeneCoverage is an estimate of the fraction of transcriptome kmers present in reads, while GenePrecision is the fraction of read kmers present in the transcriptome. GenePrecision looks low here, which could be caused by:

1) Errors in the reads
2) Heterozygosity
3) Genomic differences between the sample and reference
4) Incomplete annotation

...but that doesn't really affect the result. If it was super-low like .01 that might mean you were using the wrong transcriptome or that it was not actually RNA-seq data.

Note that in the Read Gene Calling Analysis "UsedFraction=0.000315" indicating that genes were successfully called in under 1% of reads, meaning that result isn't trustworthy anyway. As for PolyAFraction, that's very low too but I'm not really sure what it should be in a poly-A pulldown library.

For reference, this is what I get on a stranded Clostridium library:

Depth_Analysis:
Strandedness:   99.81%
AvgKmerDepth:   39.478
Kmers>Depth1:   0.3904

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     146.90
AvgORFLen+:     108.48
AvgORFLen-:     136.06
AvgStopCount+:  1.1185
AvgStopCount-:  0.4324
GC_Content:     0.3439

PolyA_Analysis:
MajorStrandPA:  Minus
PolyA/(PA+PT):  0.235013
PolyAFraction:  0.000198

Read_Gene_Calling_Analysis:
MajorStrandRGC: Minus
P/(P+M)_Ratio:  0.028127
AvgScorePlus:   46.017270
AvgScoreMinus:  59.093323
UsedFraction:   0.060805

And one where the strandedness protocol failed:

Depth_Analysis:
Strandedness:   61.83%
AvgKmerDepth:   18.079
Kmers>Depth1:   0.2047

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     122.36
AvgORFLen+:     100.90
AvgORFLen-:     106.06
AvgStopCount+:  0.7415
AvgStopCount-:  0.8313
GC_Content:     0.3087

PolyA_Analysis:
MajorStrandPA:  Plus
PolyA/(PA+PT):  0.768882
PolyAFraction:  0.002109

Read_Gene_Calling_Analysis:
MajorStrandRGC: Plus
P/(P+M)_Ratio:  0.723279
AvgScorePlus:   57.304131
AvgScoreMinus:  58.995953
UsedFraction:   0.022351

In this case when I supply the reference from a related Clostridium I get this result from the properly stranded library:

Depth_Analysis:
Strandedness:   99.81%
AvgKmerDepth:   39.478
Kmers>Depth1:   0.3904

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     146.90
AvgORFLen+:     108.48
AvgORFLen-:     136.06
AvgStopCount+:  1.1185
AvgStopCount-:  0.4324
GC_Content:     0.3439

PolyA_Analysis:
MajorStrandPA:  Minus
PolyA/(PA+PT):  0.235013
PolyAFraction:  0.000198

Ref_Analysis:
MajorStrandREF: Minus
P/(P+M)_Ratio:  0.022226
GeneCoverage:   0.8330
GenePrecision:  0.2483

Read_Gene_Calling_Analysis:
MajorStrandRGC: Minus
P/(P+M)_Ratio:  0.005949
AvgScorePlus:   57.976012
AvgScoreMinus:  67.429156
UsedFraction:   0.471850

Mostly the same, but now the gene calling is much more accurate and you can see different estimates of strandedness from the different methods. Ignoring the poly-A method (I don't even know if it applies to this data) the other methods agree that this is minus-strand data and between 97.6% and 99.4% stranded. Note, though, that simply adding the "merge" flag also makes the gene calling much more accurate, and additionally improves the results of the ORF analysis.

Depth_Analysis:
Strandedness:   99.81%
AvgKmerDepth:   49.844
Kmers>Depth1:   0.4050

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     176.55
AvgORFLen+:     120.58
AvgORFLen-:     158.25
AvgStopCount+:  1.4825
AvgStopCount-:  0.5747
GC_Content:     0.3429

PolyA_Analysis:
MajorStrandPA:  Minus
PolyA/(PA+PT):  0.227642
PolyAFraction:  0.000190

Read_Gene_Calling_Analysis:
MajorStrandRGC: Minus
P/(P+M)_Ratio:  0.009806
AvgScorePlus:   49.983473
AvgScoreMinus:  65.479292
UsedFraction:   0.157420

PairMergeRate:  0.6498

Also, this is what it looks like when I provide genome-mapped reads AND a gff AND a genome (note by default it assumes a ref is the genome):

checkstrand.sh in=nxyxx.bam gff=genes.gff ref=ref.fa

Depth_Analysis:
Strandedness:   99.81%
AvgKmerDepth:   40.699
Kmers>Depth1:   0.4032

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     146.90
AvgORFLen+:     108.47
AvgORFLen-:     136.09
AvgStopCount+:  1.1186
AvgStopCount-:  0.4309
GC_Content:     0.3438

PolyA_Analysis:
MajorStrandPA:  Minus
PolyA/(PA+PT):  0.216514
PolyAFraction:  0.000178

Ref_Analysis:
MajorStrandREF: Minus
P/(P+M)_Ratio:  0.021675
GeneCoverage:   0.8305
GenePrecision:  0.2604

Read_Gene_Calling_Analysis:
MajorStrandRGC: Minus
P/(P+M)_Ratio:  0.009125
AvgScorePlus:   55.736356
AvgScoreMinus:  67.771793
UsedFraction:   0.489669

Alignment_Results:
StrandednessAL: 94.60%
MajorStrandAL:  Minus
P/(P+M)_Ratio:  0.053950
AlignmentRate:  1.000000
Feature-Mapped: 1.000000

Here you can see the alignment results section which comes from projecting the mapped reads using the gff. Aligning to the transcriptome gives you more accurate alignment results because we no longer need to worry about nearby gene coordinates confusing things:

cutgff.sh in=ref.fa gff=genes.gff out=cut.fa
bbmap.sh in=nxyxx_r1.fq.gz out=t-mapped.sam.gz ref=cut.fa
checkstrand.sh in=t-mapped.sam.gz ref=cut.fa transcriptome

yields

Depth_Analysis:
Strandedness:   99.81%
AvgKmerDepth:   39.478
Kmers>Depth1:   0.3904

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     146.90
AvgORFLen+:     108.48
AvgORFLen-:     136.06
AvgStopCount+:  1.1185
AvgStopCount-:  0.4324
GC_Content:     0.3439

PolyA_Analysis:
MajorStrandPA:  Minus
PolyA/(PA+PT):  0.235013
PolyAFraction:  0.000198

Ref_Analysis:
MajorStrandREF: Minus
P/(P+M)_Ratio:  0.022257
GeneCoverage:   0.8328
GenePrecision:  0.2494

Read_Gene_Calling_Analysis:
MajorStrandRGC: Minus
P/(P+M)_Ratio:  0.028127
AvgScorePlus:   46.017270
AvgScoreMinus:  59.093323
UsedFraction:   0.060805

Alignment_Results:
StrandednessAL: 98.42%
MajorStrandAL:  Minus
P/(P+M)_Ratio:  0.015841
AlignmentRate:  0.833029

...which reports a higher strandedness of 98.42% from alignment, more in line with the estimates from kmer depth.

ADD REPLY • link 6 months ago by Brian Bushnell 20k

0

Entering edit mode

I aligned a couple of million reads from this sample to the genome with bbmap.sh and then used the aligned file with checkstrand.sh.

Finished processing 2000000 reads and 150000000 bases.

Depth_Analysis:
Strandedness:   96.43%
AvgKmerDepth:   1.585
Kmers>Depth1:   0.1919

Stop_Codon_Analysis:
MajorStrandORF: Minus
AvgReadLen:     75.00
AvgORFLen+:     70.17
AvgORFLen-:     73.86
AvgStopCount+:  0.2485
AvgStopCount-:  0.2818
GC_Content:     0.4786

PolyA_Analysis:
MajorStrandPA:  Plus
PolyA/(PA+PT):  0.665780
PolyAFraction:  0.001251

Read_Gene_Calling_Analysis:
MajorStrandRGC: Plus
P/(P+M)_Ratio:  0.609121
AvgScorePlus:   49.197230
AvgScoreMinus:  50.212829
UsedFraction:   0.000307

ADD REPLY • link 6 months ago by GenoMax 141k

0

Entering edit mode

Aligned reads don't help unless: 1) They are genome-aligned and you also supply a gff OR 2) They are transcriptome-aligned and you add the "transcriptome" flag.

ADD REPLY • link 6 months ago by Brian Bushnell 20k