Page 1 of 12

Port Cube

Posted: Mon Jul 04, 2022 2:58 pm
by Guest 2
The topic tells it all.

Would you please port cube to CUDA?

I am starting to play with HDR conversion and it would be so nice to have it.

Port Cube

Posted: Mon Jul 04, 2022 6:41 pm
by Bullwinkle
It's been asked before. Let us finish up the DGDecNV release and then we'll consider it. Rocky is tired, not bored. ;)

Port Cube

Posted: Sat Jul 23, 2022 3:07 am
by Guest 2
Bullwinkle wrote:
Mon Jul 04, 2022 6:41 pm
Let us finish up the DGDecNV release and then we'll consider it.
*wink wink*

I badly need it :) Want to try serious PQ to HLG with cube profile.

Port Cube

Posted: Sat Jul 23, 2022 9:20 am
by Rocky
OK, time to consider it. ;)

Port Cube

Posted: Thu Jul 28, 2022 6:21 pm
by Rocky
Started work on this. In preparation for CUDAization I made a super stripped down version that eliminates anything but the cpu=0 implementation. Next, the cpu=0 code for GetFrame() can be made into a CUDA kernel.

Port Cube

Posted: Fri Jul 29, 2022 10:40 am
by Rocky
The major complication is that Cube is built on std:: and templates (STL). However, CUDA does not support std:: in kernel code. So what I am doing is translating the GetFrame() code to simple functional form (call it C code) for use in a kernel. The GetFrame() consists logically of:

to_float # convert P16 frame data to float
process # apply the 3D LUT
from_float # convert back to P16

I have got to_float and from_float working without std::. Currently working on process.

I prefer to avoid thrust:: or any nonsense like this:

https://github.com/NVIDIA/libcudacxx

Port Cube

Posted: Fri Jul 29, 2022 3:39 pm
by Bullwinkle
OK, guys, I did some research and found that the Cube algorithm is a lot slower than it could be. So I recommend writing new code with the following changes:

* Perform the conversion to/from RGB in the kernel so it doesn't have to be done in the script.

* Get rid of the to_float/from_float steps and work directly in RGB P16.

* Use tetrahedral interpolation (instead of trilinear). It is faster and more accurate.

Port Cube

Posted: Sat Jul 30, 2022 1:13 am
by Whackbag
Would Interpret=prism be most accurate. Rigaya added cube support to Nvenc awhile ago i'm just curious how this would differ, is it just that it would be running hardware accelerated thru avisynth & that's better?...forgive my noobness i honestly don't know.

Been using DGdecNV with DGsharpen in Stax for a bit now & it's great...thank you.

Port Cube

Posted: Sat Jul 30, 2022 6:16 am
by Bullwinkle
Hello Whackbag and welcome to the forum. You're here and posting so I suppose our registration system is working OK. ;)

Don't know much about Rigaya's NVEncc or how he implements the 3D LUT stuff. But it appears to be tied to encoding with NVEnc. However, there are use cases for needing 3D LUT processing without encoding, or with encoding using an alternate encoder, as well as using Avisynth/Vapoursynth IO.

Anyone with knowledge of how Rigaya is obtaining GPU accelerated 3D LUT processing (if it is indeed accelerated and not CPU) is encouraged to post here about it.

I am aware only of the nVidia performance primitives (NPP) CUDA library as existing code for applying a GPU 3D LUT. It doesn't appear to offer any option for the interpolation method. We've never used the NPP library and prefer to write our own direct kernels. That allows for maximum feature flexibility and performance.

Regarding prism versus tetrahedral, JD Vandenburg and Stefano Andriani wrote a fine paper assessing all the methods and conclude:

"Tetrahedral interpolation outperformed all other interpolations for both SDR and HDR applications."

https://www.researchgate.net/publicatio ... Q_Keywords

Port Cube

Posted: Sat Jul 30, 2022 7:58 pm
by Rocky
Bullwinkle, everything you say is correct but I want to start with porting what we have. After that there will be ample time to work on your ideas.

So, after many hours today I have completely ridded the filter of all std::, template, vector, and other abstractions that impeded creation of a simple functional kernel. That was not trivial. Now I am ready to start writing the kernel and testing it. Whee!

Port Cube

Posted: Sun Jul 31, 2022 2:02 am
by hydra3333
I do like reading about smart people doing smart things. Please keep us posted. :)

Port Cube

Posted: Sun Jul 31, 2022 9:28 am
by Curly
You are too kind. I do try to remain modest, despite my brilliance.

Port Cube

Posted: Mon Aug 01, 2022 3:24 pm
by Rocky
What would we do without Curly?

Guys, I completed an initial implementation of CUDA-enabled DGCube. It beats straight Cube (AVX512) by 28% with prefetch(0), but with prefetch(6) straight Cube wins by 30% on my 11700K. That is with a BlankClip() source.

CUDA could make up for that if there were not the large times for PCIe transfers to/from the GPU. Sadly, they cannot be eliminated.

So, nothing to gain here from CUDA, it appears. Still, Bullwinkle's ideas can be applied to straight Cube as well. ;)

@Guest 2

I tried!

Marking resolved as no further work will be done on this.

Port Cube

Posted: Mon Aug 01, 2022 8:40 pm
by Rocky
Just did an experiment to see the proportion of time between kernel execution and PCIe transfer. The kernel execution accounted for 0.4 seconds for 1000 frames at UHD resolution. The PCIe transfers accounted for 12.6 seconds. It's very sad because Cube using AVX512 and prefetch(6) requires 6 seconds, so the algorithm itself runs much faster in a CUDA kernel. We need a magic wand to get the data to/from the GPU.

Rest in peace CUDA for high-resolution image processing.

Port Cube

Posted: Tue Aug 02, 2022 2:19 am
by thechaoscoder
Maybe it will be twice as fast with PCIe 5.0 :scratch:

Does Resizable BAR have any effect on the speed?

Port Cube

Posted: Tue Aug 02, 2022 3:47 am
by hydra3333
Thanks ! :bravo:

I don't have to like the answer, but I do have to like new knowledge that cost/benefit/effectiveness has been proven beyond doubt and naming a good reason why.

One could assume that the same or a similar constraint applies to other gpu manufacturers' kits.

A bit of a pity.
At least I feel comfortable not having to save ginormous $ for a new card :D

Port Cube

Posted: Wed Aug 03, 2022 5:56 am
by Bullwinkle
thechaoscoder wrote:
Tue Aug 02, 2022 2:19 am
Maybe it will be twice as fast with PCIe 5.0 :scratch:
Perhaps, but all that expensive HW would put it only on a par with Cube. :cry:
Does Resizable BAR have any effect on the speed?
If I knew what that was I could answer. Off to do some research.

EDIT: Gain from resizable BAR is of the order of only 10%.

Yes, hydra3333, all GPUs are constrained by the PCIe bus bandwidth. Well, except for the iGPU's, which may be worth exploring for accelerated image processing.

Port Cube

Posted: Sun Aug 07, 2022 3:47 am
by Guest 2
Rocky wrote:
Mon Aug 01, 2022 3:24 pm
I tried!
You are considering a modern CPU as reference for your benchmarks.

I am running on a museum ready i7-2600k... do you think it could keep up against CUDA? :D

Moreover, did you try to x265 encode at the same time, instead of just measuring raw cube mapping performance? I think that is the field where you could find the highest advancements and that's a real world scenario.

If you have any code to test, I'd like to. Any speed enhancement would be great.

P.S: DTL on doom9 is porting MVTools to DirectX12 and he was having the same overhead on bus. I think he has found some interesting workarounds as his work is proceding almost smoothly.

Port Cube

Posted: Sun Aug 07, 2022 8:53 am
by Rocky
Happy to let you have it. Will upload after I leave the nest this morning.

Can you tell me about DTL's workarounds? Any links?

Port Cube

Posted: Sun Aug 07, 2022 9:55 am
by Rocky

Port Cube

Posted: Sun Aug 07, 2022 10:05 am
by Guest 2
Rocky wrote:
Sun Aug 07, 2022 8:53 am
Can you tell me about DTL's workarounds? Any links?
https://forum.doom9.org/showthread.php?t=183517

It's a long thread, where he seemed to go thru some of your issues.
Rocky wrote:
Sun Aug 07, 2022 9:55 am
Here's DGCube
Great! Does it follow the same syntax of AVSCube?

Port Cube

Posted: Sun Aug 07, 2022 11:15 am
by Rocky
Should be same syntax, yes.

Thank you for the link.

EDIT: Latest version is invoked as DGCube() and adds tetrahedral interpolation (default).

Port Cube

Posted: Sun Aug 07, 2022 11:52 am
by Guest 2
Rocky wrote:
Sun Aug 07, 2022 11:15 am
Thank you for the link.
Thanks for your efforts.

It would be nice to have some of your results with cube (avs and cuda) + mvtools + x265 --preset slow, to simulate real world sets. I don't know with your processor but some offload of a very intensive task can be really useful.

Port Cube

Posted: Sun Aug 07, 2022 2:17 pm
by Rocky
Sure. Lemme get the 4:4:4 stuff finished first.

Port Cube

Posted: Mon Aug 08, 2022 3:25 am
by Guest 2
Rocky wrote:
Sun Aug 07, 2022 2:17 pm
Sure. Lemme get the 4:4:4 stuff finished first.
I tried DGCube and I get access violation error. Same script works fine with AVSCube.

AVSCube:

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
LoadPlugin("D:\Eseguibili\Media\AVSCube\VSCube.dll")
DGSource("F:\In\2_0446 Akira TAiCHi\taichi.dgi",ct=48,cb=48,cl=0,cr=0, rw=1920, rh=1032)
#From 4:2:2 16bit planar Narrow Range to RGB Planar 16bit Full Range
z_ConvertFormat(pixel_type="RGBP16", colorspace_op="2020:st2084:2020:limited=>rgb:st2084:2020:full", resample_filter_uv="spline64", dither_type="error_diffusion")
#From PQ to HLG with 16bit precision
Cube("D:\Programmi\Media\AviSynth+\cube\1a_PQ1000_HLG_mode-nar_in-nar_out-nar_nocomp.cube", fullrange=true)
#From RGB 16bit planar Full Range to YUV422 10bit planar Narrow Range with dithering
z_ConvertFormat(pixel_type="YUV422P10", colorspace_op="rgb:std-b67:2020:full=>2020:std-b67:2020:limited", resample_filter_uv="spline64", dither_type="error_diffusion")

DGCube:

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
LoadPlugin("D:\Eseguibili\Media\DGCube.dll")
DGSource("F:\In\2_0446 Akira TAiCHi\taichi.dgi",ct=48,cb=48,cl=0,cr=0, rw=1920, rh=1032)
#From 4:2:2 16bit planar Narrow Range to RGB Planar 16bit Full Range
z_ConvertFormat(pixel_type="RGBP16", colorspace_op="2020:st2084:2020:limited=>rgb:st2084:2020:full", resample_filter_uv="spline64", dither_type="error_diffusion")
#From PQ to HLG with 16bit precision
Cube("D:\Programmi\Media\AviSynth+\cube\1a_PQ1000_HLG_mode-nar_in-nar_out-nar_nocomp.cube", fullrange=true)
#From RGB 16bit planar Full Range to YUV422 10bit planar Narrow Range with dithering
z_ConvertFormat(pixel_type="YUV422P10", colorspace_op="rgb:std-b67:2020:full=>2020:std-b67:2020:limited", resample_filter_uv="spline64", dither_type="error_diffusion")

AVSMeter 3.0.8.0 (x64), (c) Groucho2004, 2012-2021
AviSynth+ 3.7.2 (r3682, 3.7, x86_64) (3.7.2.0)

Exception 0xC0000005 [STATUS_ACCESS_VIOLATION]
Module: D:\Eseguibili\Media\DGCube.dll
Address: 0x00007FFA9AC51651

Same error in VirtualDub too.