Port Cube

These CUDA filters are packaged into DGDecodeNV, which is part of DGDecNV.
Post Reply
User avatar
tormento
DG Approved/Curly Approved/Moose Approved
Posts: 686
Joined: Mon Sep 20, 2010 2:18 pm

Port Cube

Post by tormento »

The topic tells it all.

Would you please port cube to CUDA?

I am starting to play with HDR conversion and it would be so nice to have it.
User avatar
Bullwinkle
Moose Approved
Posts: 295
Joined: Thu Sep 05, 2019 6:37 pm

Port Cube

Post by Bullwinkle »

It's been asked before. Let us finish up the DGDecNV release and then we'll consider it. Rocky is tired, not bored. ;)
User avatar
tormento
DG Approved/Curly Approved/Moose Approved
Posts: 686
Joined: Mon Sep 20, 2010 2:18 pm

Port Cube

Post by tormento »

Bullwinkle wrote:
Mon Jul 04, 2022 6:41 pm
Let us finish up the DGDecNV release and then we'll consider it.
*wink wink*

I badly need it :) Want to try serious PQ to HLG with cube profile.
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

OK, time to consider it. ;)
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

Started work on this. In preparation for CUDAization I made a super stripped down version that eliminates anything but the cpu=0 implementation. Next, the cpu=0 code for GetFrame() can be made into a CUDA kernel.
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

The major complication is that Cube is built on std:: and templates (STL). However, CUDA does not support std:: in kernel code. So what I am doing is translating the GetFrame() code to simple functional form (call it C code) for use in a kernel. The GetFrame() consists logically of:

to_float # convert P16 frame data to float
process # apply the 3D LUT
from_float # convert back to P16

I have got to_float and from_float working without std::. Currently working on process.

I prefer to avoid thrust:: or any nonsense like this:

https://github.com/NVIDIA/libcudacxx
User avatar
Bullwinkle
Moose Approved
Posts: 295
Joined: Thu Sep 05, 2019 6:37 pm

Port Cube

Post by Bullwinkle »

OK, guys, I did some research and found that the Cube algorithm is a lot slower than it could be. So I recommend writing new code with the following changes:

* Perform the conversion to/from RGB in the kernel so it doesn't have to be done in the script.

* Get rid of the to_float/from_float steps and work directly in RGB P16.

* Use tetrahedral interpolation (instead of trilinear). It is faster and more accurate.
DAE avatar
Whackbag
Posts: 2
Joined: Sat Jul 30, 2022 12:51 am

Port Cube

Post by Whackbag »

Would Interpret=prism be most accurate. Rigaya added cube support to Nvenc awhile ago i'm just curious how this would differ, is it just that it would be running hardware accelerated thru avisynth & that's better?...forgive my noobness i honestly don't know.

Been using DGdecNV with DGsharpen in Stax for a bit now & it's great...thank you.
User avatar
Bullwinkle
Moose Approved
Posts: 295
Joined: Thu Sep 05, 2019 6:37 pm

Port Cube

Post by Bullwinkle »

Hello Whackbag and welcome to the forum. You're here and posting so I suppose our registration system is working OK. ;)

Don't know much about Rigaya's NVEncc or how he implements the 3D LUT stuff. But it appears to be tied to encoding with NVEnc. However, there are use cases for needing 3D LUT processing without encoding, or with encoding using an alternate encoder, as well as using Avisynth/Vapoursynth IO.

Anyone with knowledge of how Rigaya is obtaining GPU accelerated 3D LUT processing (if it is indeed accelerated and not CPU) is encouraged to post here about it.

I am aware only of the nVidia performance primitives (NPP) CUDA library as existing code for applying a GPU 3D LUT. It doesn't appear to offer any option for the interpolation method. We've never used the NPP library and prefer to write our own direct kernels. That allows for maximum feature flexibility and performance.

Regarding prism versus tetrahedral, JD Vandenburg and Stefano Andriani wrote a fine paper assessing all the methods and conclude:

"Tetrahedral interpolation outperformed all other interpolations for both SDR and HDR applications."

https://www.researchgate.net/publicatio ... Q_Keywords
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

Bullwinkle, everything you say is correct but I want to start with porting what we have. After that there will be ample time to work on your ideas.

So, after many hours today I have completely ridded the filter of all std::, template, vector, and other abstractions that impeded creation of a simple functional kernel. That was not trivial. Now I am ready to start writing the kernel and testing it. Whee!
User avatar
hydra3333
DG Approved/Moose Approved
Posts: 274
Joined: Wed Oct 06, 2010 3:34 am
Contact:

Port Cube

Post by hydra3333 »

I do like reading about smart people doing smart things. Please keep us posted. :)
User avatar
Curly
Moose Approved
Posts: 188
Joined: Sun Mar 15, 2020 11:05 am

Port Cube

Post by Curly »

You are too kind. I do try to remain modest, despite my brilliance.
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

What would we do without Curly?

Guys, I completed an initial implementation of CUDA-enabled DGCube. It beats straight Cube (AVX512) by 28% with prefetch(0), but with prefetch(6) straight Cube wins by 30% on my 11700K. That is with a BlankClip() source.

CUDA could make up for that if there were not the large times for PCIe transfers to/from the GPU. Sadly, they cannot be eliminated.

So, nothing to gain here from CUDA, it appears. Still, Bullwinkle's ideas can be applied to straight Cube as well. ;)

@tormento

I tried!

Marking resolved as no further work will be done on this.
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

Just did an experiment to see the proportion of time between kernel execution and PCIe transfer. The kernel execution accounted for 0.4 seconds for 1000 frames at UHD resolution. The PCIe transfers accounted for 12.6 seconds. It's very sad because Cube using AVX512 and prefetch(6) requires 6 seconds, so the algorithm itself runs much faster in a CUDA kernel. We need a magic wand to get the data to/from the GPU.

Rest in peace CUDA for high-resolution image processing.
User avatar
thechaoscoder
Moose Approved
Posts: 34
Joined: Tue Jul 14, 2020 8:34 am

Port Cube

Post by thechaoscoder »

Maybe it will be twice as fast with PCIe 5.0 :scratch:

Does Resizable BAR have any effect on the speed?
User avatar
hydra3333
DG Approved/Moose Approved
Posts: 274
Joined: Wed Oct 06, 2010 3:34 am
Contact:

Port Cube

Post by hydra3333 »

Thanks ! :bravo:

I don't have to like the answer, but I do have to like new knowledge that cost/benefit/effectiveness has been proven beyond doubt and naming a good reason why.

One could assume that the same or a similar constraint applies to other gpu manufacturers' kits.

A bit of a pity.
At least I feel comfortable not having to save ginormous $ for a new card :D
User avatar
Bullwinkle
Moose Approved
Posts: 295
Joined: Thu Sep 05, 2019 6:37 pm

Port Cube

Post by Bullwinkle »

thechaoscoder wrote:
Tue Aug 02, 2022 2:19 am
Maybe it will be twice as fast with PCIe 5.0 :scratch:
Perhaps, but all that expensive HW would put it only on a par with Cube. :cry:
Does Resizable BAR have any effect on the speed?
If I knew what that was I could answer. Off to do some research.

EDIT: Gain from resizable BAR is of the order of only 10%.

Yes, hydra3333, all GPUs are constrained by the PCIe bus bandwidth. Well, except for the iGPU's, which may be worth exploring for accelerated image processing.
User avatar
tormento
DG Approved/Curly Approved/Moose Approved
Posts: 686
Joined: Mon Sep 20, 2010 2:18 pm

Port Cube

Post by tormento »

Rocky wrote:
Mon Aug 01, 2022 3:24 pm
I tried!
You are considering a modern CPU as reference for your benchmarks.

I am running on a museum ready i7-2600k... do you think it could keep up against CUDA? :D

Moreover, did you try to x265 encode at the same time, instead of just measuring raw cube mapping performance? I think that is the field where you could find the highest advancements and that's a real world scenario.

If you have any code to test, I'd like to. Any speed enhancement would be great.

P.S: DTL on doom9 is porting MVTools to DirectX12 and he was having the same overhead on bus. I think he has found some interesting workarounds as his work is proceding almost smoothly.
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

Happy to let you have it. Will upload after I leave the nest this morning.

Can you tell me about DTL's workarounds? Any links?
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

User avatar
tormento
DG Approved/Curly Approved/Moose Approved
Posts: 686
Joined: Mon Sep 20, 2010 2:18 pm

Port Cube

Post by tormento »

Rocky wrote:
Sun Aug 07, 2022 8:53 am
Can you tell me about DTL's workarounds? Any links?
https://forum.doom9.org/showthread.php?t=183517

It's a long thread, where he seemed to go thru some of your issues.
Rocky wrote:
Sun Aug 07, 2022 9:55 am
Here's DGCube
Great! Does it follow the same syntax of AVSCube?
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

Should be same syntax, yes.

Thank you for the link.

EDIT: Latest version is invoked as DGCube() and adds tetrahedral interpolation (default).
User avatar
tormento
DG Approved/Curly Approved/Moose Approved
Posts: 686
Joined: Mon Sep 20, 2010 2:18 pm

Port Cube

Post by tormento »

Rocky wrote:
Sun Aug 07, 2022 11:15 am
Thank you for the link.
Thanks for your efforts.

It would be nice to have some of your results with cube (avs and cuda) + mvtools + x265 --preset slow, to simulate real world sets. I don't know with your processor but some offload of a very intensive task can be really useful.
User avatar
Rocky
Moose Approved
Posts: 2421
Joined: Fri Sep 06, 2019 12:57 pm

Port Cube

Post by Rocky »

Sure. Lemme get the 4:4:4 stuff finished first.
User avatar
tormento
DG Approved/Curly Approved/Moose Approved
Posts: 686
Joined: Mon Sep 20, 2010 2:18 pm

Port Cube

Post by tormento »

Rocky wrote:
Sun Aug 07, 2022 2:17 pm
Sure. Lemme get the 4:4:4 stuff finished first.
I tried DGCube and I get access violation error. Same script works fine with AVSCube.

AVSCube:

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
LoadPlugin("D:\Eseguibili\Media\AVSCube\VSCube.dll")
DGSource("F:\In\2_0446 Akira TAiCHi\taichi.dgi",ct=48,cb=48,cl=0,cr=0, rw=1920, rh=1032)
#From 4:2:2 16bit planar Narrow Range to RGB Planar 16bit Full Range
z_ConvertFormat(pixel_type="RGBP16", colorspace_op="2020:st2084:2020:limited=>rgb:st2084:2020:full", resample_filter_uv="spline64", dither_type="error_diffusion")
#From PQ to HLG with 16bit precision
Cube("D:\Programmi\Media\AviSynth+\cube\1a_PQ1000_HLG_mode-nar_in-nar_out-nar_nocomp.cube", fullrange=true)
#From RGB 16bit planar Full Range to YUV422 10bit planar Narrow Range with dithering
z_ConvertFormat(pixel_type="YUV422P10", colorspace_op="rgb:std-b67:2020:full=>2020:std-b67:2020:limited", resample_filter_uv="spline64", dither_type="error_diffusion")

DGCube:

LoadPlugin("D:\Eseguibili\Media\DGDecNV\DGDecodeNV.dll")
LoadPlugin("D:\Eseguibili\Media\DGCube.dll")
DGSource("F:\In\2_0446 Akira TAiCHi\taichi.dgi",ct=48,cb=48,cl=0,cr=0, rw=1920, rh=1032)
#From 4:2:2 16bit planar Narrow Range to RGB Planar 16bit Full Range
z_ConvertFormat(pixel_type="RGBP16", colorspace_op="2020:st2084:2020:limited=>rgb:st2084:2020:full", resample_filter_uv="spline64", dither_type="error_diffusion")
#From PQ to HLG with 16bit precision
Cube("D:\Programmi\Media\AviSynth+\cube\1a_PQ1000_HLG_mode-nar_in-nar_out-nar_nocomp.cube", fullrange=true)
#From RGB 16bit planar Full Range to YUV422 10bit planar Narrow Range with dithering
z_ConvertFormat(pixel_type="YUV422P10", colorspace_op="rgb:std-b67:2020:full=>2020:std-b67:2020:limited", resample_filter_uv="spline64", dither_type="error_diffusion")

AVSMeter 3.0.8.0 (x64), (c) Groucho2004, 2012-2021
AviSynth+ 3.7.2 (r3682, 3.7, x86_64) (3.7.2.0)

Exception 0xC0000005 [STATUS_ACCESS_VIOLATION]
Module: D:\Eseguibili\Media\DGCube.dll
Address: 0x00007FFA9AC51651

Same error in VirtualDub too.
Post Reply