CUDASynth
CUDASynth
Looking at available frameworks for CUDA/NVDec with Avisynth/Vapoursynth, we do not see proper pipelines running on the GPU that eliminate unnecessary PCIe frame transfers and copies into Avisynth/Vapoursynth buffers. For example, we may see the set of internal filters ported to CUDA. The spurious PCIe transfers and copies for intermediate filters are not eliminated. A solution is not so easy because CUDA enforces some constraints that are challenging for implementing such a true GPU pipeline of filters. Importantly, we should prefer to avoid having to modify Avisynth/Vapoursynth itself.
The basic CUDASynth idea is to have two GPU frame buffers accessible to every filter. One holds the input and one takes the output. The output would be the CPU for the last filter in the chain. One floating CUDA context is used globally, so that all the filters in the chain (script) can access the frame buffers. The frame buffers are used in a ping-pong manner.
The source filter creates the global CUDA context and allocates the input/output frame buffers gpu0 and gpu1. The context id and frame buffer pointers must be communicated to the other filters. Currently a file is used but this can be changed to a named pipe, shared memory, etc.
A CUDASynth script allowing for a pipeline of execution on the GPU might be:
DGSource(..., dst="gpu0")
DGSharpen(...,src="gpu0", dst="gpu1")
DGDenoise(..., src="gpu1", dst="cpu")
That is the design concept. Four full frame copies CPU<-->GPU and two copies to Avisynth/Vapoursynth are eliminated!
Currently, the global context and communication to the other script filters have been implemented and tested. Next will be the declaration of the GPU input/output buffers and testing of the GPU pipeline.
Current version of CUDASynth:
http://rationalqm.us/misc/CUDASynth_0.3.rar
The basic CUDASynth idea is to have two GPU frame buffers accessible to every filter. One holds the input and one takes the output. The output would be the CPU for the last filter in the chain. One floating CUDA context is used globally, so that all the filters in the chain (script) can access the frame buffers. The frame buffers are used in a ping-pong manner.
The source filter creates the global CUDA context and allocates the input/output frame buffers gpu0 and gpu1. The context id and frame buffer pointers must be communicated to the other filters. Currently a file is used but this can be changed to a named pipe, shared memory, etc.
A CUDASynth script allowing for a pipeline of execution on the GPU might be:
DGSource(..., dst="gpu0")
DGSharpen(...,src="gpu0", dst="gpu1")
DGDenoise(..., src="gpu1", dst="cpu")
That is the design concept. Four full frame copies CPU<-->GPU and two copies to Avisynth/Vapoursynth are eliminated!
Currently, the global context and communication to the other script filters have been implemented and tested. Next will be the declaration of the GPU input/output buffers and testing of the GPU pipeline.
Current version of CUDASynth:
http://rationalqm.us/misc/CUDASynth_0.3.rar
Re: CUDASynth
Sounds like you're making progress.
Re: CUDASynth
Interesting idea
Re: CUDASynth
Thanks guys!
I've implemented the ping-pong buffers declaration by the DGSource filter and tested that the intermediate filters can access the buffers. Next main step is to modify DGSource to convert the NV12 delivered by CUVID to YUV420P8/16 on the GPU so that the intermediate filters get the right color space. Previously, this conversion was done on the CPU after transferring the decoded source frame back to the CPU and just before returning the frame to Avisynth. This change should improve performance of DGSource() in both the pipelined and nonpipelined use cases as the expensive NV12->YUV420 conversion is moved to the GPU. After that, I'll add the src/dst parameters and implement the pipeline. I expect this framework to bring out the true power of CUDA in an Avisynth/Vapoursynth environment.
I've implemented the ping-pong buffers declaration by the DGSource filter and tested that the intermediate filters can access the buffers. Next main step is to modify DGSource to convert the NV12 delivered by CUVID to YUV420P8/16 on the GPU so that the intermediate filters get the right color space. Previously, this conversion was done on the CPU after transferring the decoded source frame back to the CPU and just before returning the frame to Avisynth. This change should improve performance of DGSource() in both the pipelined and nonpipelined use cases as the expensive NV12->YUV420 conversion is moved to the GPU. After that, I'll add the src/dst parameters and implement the pipeline. I expect this framework to bring out the true power of CUDA in an Avisynth/Vapoursynth environment.
Re: CUDASynth
DDSource even faster, and the filters as well.
Awesome
Awesome
Re: CUDASynth
The NV12->YV12 conversion on the GPU was a bit harder than I expected. Nevertheless, I have it working now. It did not produce a significant DGSource performance improvement over my previous code (just a few fps). I can understand that because it is a small part of the overall frame decode and delivery, and because I was not able (as I had hoped) to avoid a pitched frame copy into the Avisynth frame because the decoded frame pitch can differ from the Avisynth frame pitch. The main point is that I can now source a YV12 frame down the pipeline on the GPU.
The next step is to implement the fsrc/fdst parameters for DGSource and another filter such as DGSharpen, and then try to get the pipeline running. A challenge may be to ensure that the downstream filters use the correct pitch.
The next step is to implement the fsrc/fdst parameters for DGSource and another filter such as DGSharpen, and then try to get the pipeline running. A challenge may be to ensure that the downstream filters use the correct pitch.
Re: CUDASynth
Gosh, I thought it was perfectly clear.
Here's the CUDA kernel code. It should make everything clear for you. Note that this just deinterleaves the chroma. The luma does not have to be touched and is just copied with cuMemcpyDtoD(). Also note that w is actually passed in as the pitch and h is half the frame height. That simplifies the code.
Code: Select all
extern "C"
__global__ void NV12toYV12(unsigned char *src, unsigned char *dst, int w, int h)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
unsigned char *u, *v, *s;
u = dst + (h * 2) * w;
v = u + h * w / 2;
s = src + (h * 2) * w;
if (ix < w && iy < h)
{
if (!(ix & 1))
u[w * iy / 2 + ix / 2] = s[w * iy + ix];
else
v[w * iy / 2 + ix / 2] = s[w * iy + ix];
}
}
Re: CUDASynth
You're welcome.
Working on the sharing of the ping-pong buffers between the filter threads. It should be easy but I have a DGDecodeNV CUDA autolock that I have to make available to the other filters so they can all avoid collisions when pushing/popping the context.
Working on the sharing of the ping-pong buffers between the filter threads. It should be easy but I have a DGDecodeNV CUDA autolock that I have to make available to the other filters so they can all avoid collisions when pushing/popping the context.
Re: CUDASynth
I found the bug that prevented the downstream filters from accessing the GPU ping-pong buffers declared by DGSource(). All the filters can now access the ping-pong buffers using the information passed in the file, which means they do not have to be built together. The file now also passes the CUDA lock that prevents collisions on the context usage. It's not actually needed to pass the context now, because that is implicit in the lock. So I am now passing the lock and the gpu0/1 references in binary form. It's all working!
I ran my first pipeline DGSource() -> DGSharpen() -> Avisynth+. With just DGSource(fdst="cpu") alone I get 206 fps for 3840x2160 HEVC, so that is the maximum fps that can be expected. Adding DGSharpen() with no pipeline, i.e.:
DGSource(fdst="cpu")
DGSharpen(fsrc="cpu")
I get 127 fps. Adding DGSharpen() pipelined, i.e.:
DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
I get 190 fps. So one way to look at this is that the "price" of DGSharpen() without a pipeline is 206-127 = 79fps. The price of DGSharpen() with a pipeline is 206-190 = 16 fps. The price is reduced by a factor of 5. The total savings would increase with the number of filters in the pipeline.
I ran my first pipeline DGSource() -> DGSharpen() -> Avisynth+. With just DGSource(fdst="cpu") alone I get 206 fps for 3840x2160 HEVC, so that is the maximum fps that can be expected. Adding DGSharpen() with no pipeline, i.e.:
DGSource(fdst="cpu")
DGSharpen(fsrc="cpu")
I get 127 fps. Adding DGSharpen() pipelined, i.e.:
DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
I get 190 fps. So one way to look at this is that the "price" of DGSharpen() without a pipeline is 206-127 = 79fps. The price of DGSharpen() with a pipeline is 206-190 = 16 fps. The price is reduced by a factor of 5. The total savings would increase with the number of filters in the pipeline.
Re: CUDASynth
The price is reduced by a factor of 5
Sales are good, faster is better
Re: CUDASynth
Wow, gonca, listen to this...
I did some optimizations to reduce the sizes of the critical sections for the lock. Then I ran this script:
DGSource(fdst="cpu")
I got 207.1 fps. Note that adding prefetch here does not help and in fact greatly reduces the fps.
Now I ran this script:
DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(2)
I got 206.3 fps! This means that thanks to CUDASynth, a limited sharpen filter is being executed on a 3840x2160 frame essentially for free.
This is what I mean by bringing out the true power of NVDec/CUDA for Avisynth/Vapoursynth.
I did some optimizations to reduce the sizes of the critical sections for the lock. Then I ran this script:
DGSource(fdst="cpu")
I got 207.1 fps. Note that adding prefetch here does not help and in fact greatly reduces the fps.
Now I ran this script:
DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(2)
I got 206.3 fps! This means that thanks to CUDASynth, a limited sharpen filter is being executed on a 3840x2160 frame essentially for free.
This is what I mean by bringing out the true power of NVDec/CUDA for Avisynth/Vapoursynth.
Re: CUDASynth
This brings up the issue of hardware encoding.
Depending on Nvidia's capabilities on the new cards, a two card system could be amazingly fast
One card pre-processing and frame serving, and the second encoding
Depending on Nvidia's capabilities on the new cards, a two card system could be amazingly fast
One card pre-processing and frame serving, and the second encoding
Re: CUDASynth
For sure. Someone send me another 1080 Ti, or maybe an RTX 2080 Ti. I would accept either one.
Even with one GPU we can make an encoder filter and put it at the end of the pipeline taking input from gpu0/1, saving PCIe and copies between the Avisynth output and the encoder. On the list now.
Re: CUDASynth
That is truly impressive.admin wrote: ↑Sat Sep 08, 2018 12:40 pmNow I ran this script:
DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(2)
I got 206.3 fps! This means that thanks to CUDASynth, a limited sharpen filter is being executed on a 3840x2160 frame essentially for free.
This is what I mean by bringing out the true power of NVDec/CUDA for Avisynth/Vapoursynth.
And I'd thought I could not get any more excited ...
I really do like it here.
Re: CUDASynth
I finished supporting both fsrc and fdst for DGSharpen so I am able to show the results on a longer pipeline. Here is the script:
dgsource("LG Chess 4K Demo.dgi",fulldepth=false,fdst="gpu0")
DGSharpen(fsrc="gpu0",fdst="gpu1")
DGSharpen(fsrc="gpu1",fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(5)
For the 3840x2160 59.94 fps stream I get 161.6 fps. When I do not use the pipeline (all fsrc and fdst are "cpu"), then I get 76.8 fps. So CUDASynth is twice as fast and makes the difference between real-time and non-real-time operation. There is still a lot of headroom to have more filters while remaining real-time. The average "price" of the sharpens here is about 15 fps each.
Next, after I add P16 support, I'm going to CUDASynth-enable DGHDRtoSDR and see what kind of frame rate we can get for a full HDR to SDR script.
dgsource("LG Chess 4K Demo.dgi",fulldepth=false,fdst="gpu0")
DGSharpen(fsrc="gpu0",fdst="gpu1")
DGSharpen(fsrc="gpu1",fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(5)
For the 3840x2160 59.94 fps stream I get 161.6 fps. When I do not use the pipeline (all fsrc and fdst are "cpu"), then I get 76.8 fps. So CUDASynth is twice as fast and makes the difference between real-time and non-real-time operation. There is still a lot of headroom to have more filters while remaining real-time. The average "price" of the sharpens here is about 15 fps each.
Next, after I add P16 support, I'm going to CUDASynth-enable DGHDRtoSDR and see what kind of frame rate we can get for a full HDR to SDR script.
Re: CUDASynth
what kind of frame rate we can get for a full HDR to SDR script.
This should be good!
Re: CUDASynth
Oy, it took a whole week to get the pipeline running in P16 [DGSource(fulldepth=true]. My NV12toYV12 kernel was broken for P16 and I had some bad pitch handling in DGSharpen for the case of fsrc=gpu and fdst=gpu. The implementation was tricky as debugging is hard when the intermediate ping-pong buffers are not visible from the host to check their contents at various points along the pipeline. Writing code to copy them down to the host would be possible but tricky so I resorted to some hunches and other tricks such as memsetting the device memory at various points to see if the values showed up at the final output as expected. It was something of a week-long nightmare because when I go to bed with outstanding bugs in my code my brain works overtime all night. And the bugs were like a layered onion; rack your brains to fix one layer and another gets exposed. After multiple layers in a row stretching over a week I got pretty zonked out from lack of proper sleep.
But I am a patient and persistent soul and never give up (unless the goal is theoretically impossible), and so everything is working perfectly now for a 4-filter GPU pipeline (DGSource + 3 x DGSharpen) in P8 and P16. I'm going to CUDASynth-enable DGHDRtoSDR now and see how it performs in a DGSource->DGHDRtoSDR pipeline.
At some point I will publish a specification for how to implement a CUDASynth compatible filter together with a source code example. Without that CUDASynth acceleration would be limited to my own filters.
But I am a patient and persistent soul and never give up (unless the goal is theoretically impossible), and so everything is working perfectly now for a 4-filter GPU pipeline (DGSource + 3 x DGSharpen) in P8 and P16. I'm going to CUDASynth-enable DGHDRtoSDR now and see how it performs in a DGSource->DGHDRtoSDR pipeline.
At some point I will publish a specification for how to implement a CUDASynth compatible filter together with a source code example. Without that CUDASynth acceleration would be limited to my own filters.
Re: CUDASynth
I guess you never give up then since theoretically impossible is theoretically impossible.
Theoretically improbable, maybe
Theoretically not possible at this time, ok
But with technological improvements and increases in knowledge what is impossible today might be possible tomorrow
Therefore
theoretically impossible is theoretically impossible
and QED
you never give up
Theoretically improbable, maybe
Theoretically not possible at this time, ok
But with technological improvements and increases in knowledge what is impossible today might be possible tomorrow
Therefore
theoretically impossible is theoretically impossible
and QED
you never give up
Re: CUDASynth
OK.
Re: CUDASynth
I am just happy you never give up
We keep getting new and better avs/vpy tools to use
Thank you
We keep getting new and better avs/vpy tools to use
Thank you
Re: CUDASynth
I have CUDASynth-enabled DGHDRtoSDR and have some preliminary performance numbers for your enjoyment. The source is the same as previously used: 3840x2160 59.94 fps HDR10. The script with GPU pipelining is:
dgsource("LG Chess 4K Demo.dgi",fulldepth=true,fdst="gpu0")
dghdrtosdr(impl="255",light=250,fsrc="gpu0") # outputs YV12
prefetch(4)
Not pipelined on GPU: 80 fps, CPU 13%
Pipelined on GPU: 204 fps, CPU 8%
Quite a substantial performance boost!
dgsource("LG Chess 4K Demo.dgi",fulldepth=true,fdst="gpu0")
dghdrtosdr(impl="255",light=250,fsrc="gpu0") # outputs YV12
prefetch(4)
Not pipelined on GPU: 80 fps, CPU 13%
Pipelined on GPU: 204 fps, CPU 8%
Quite a substantial performance boost!
Re: CUDASynth
Nice boost, 250%
Re: CUDASynth
Any ETA on public testing?
No rush, just getting antsy
No rush, just getting antsy
Re: CUDASynth
Still have some things to finish up: Vapoursynth support, fdst parameter for DGHDRtoSDR, CUDASynth-enable DGDenoise, documentation, source code example. And I'm timesharing with DGIndex MKV support. Hang in there.
BTW, the CPU reduction is also important as it leaves more CPU for encoding.
It's hard to find 2080 Ti's:
https://www.nowinstock.net/computers/vi ... rtx2080ti/
Puts the lie to some of the whining at other forums by people saying it's too expensive, nobody wants it, nVidia are dirty rotten criminal capitalists, how dare they make a GPU I can't afford, blah blah blah.
gonca, what's the fastest most powerful Threadripper likely to be available within a few months?
I saved a lot of dough doing my own bathroom remodeling so I have the ready green to dish out for the best hardware.
BTW, the CPU reduction is also important as it leaves more CPU for encoding.
It's hard to find 2080 Ti's:
https://www.nowinstock.net/computers/vi ... rtx2080ti/
Puts the lie to some of the whining at other forums by people saying it's too expensive, nobody wants it, nVidia are dirty rotten criminal capitalists, how dare they make a GPU I can't afford, blah blah blah.
gonca, what's the fastest most powerful Threadripper likely to be available within a few months?
I saved a lot of dough doing my own bathroom remodeling so I have the ready green to dish out for the best hardware.