Not wishing to intrude and not understanding what's going on around me,
I have watched this a few dozen times over the last year or so and admired the song.
Great one, hydra3333. Awesome tribute to Vincent, the tortured soul.
OK, so the z code returns values of Kr and Kb that I can inspect. Kg is 1 - (Kr + Kb). Then it all goes into AVX512 etc. And the stupid STL and object-oriented filter graph crapola makes it impenetrable. No comments in the code either. Not trying to ding z. I'm sure it's just great for genius-level humans, but I'm a primitive rodent and so use only procedural code without all the crapola. Look at thdmerge source to see the kind of code I like. However, there is a standard way to convert Kr/Kb/Kg to actual matrix equations. I'll do that and see what results viz-a-viz my equations.
@tormento
What is the claimed speedup for the 'fast' mode you described? I'm just not grokking how multithreading can make a significant difference. The bottleneck is simply the amount of data versus the available PCIe bandwidth. Kernel launch, etc., is insignificant compared to that. I keep saying this but nobody gives up the magic!