Development Dialog with Nvidia

Following is the dialog between myself and Nvidia that took place during the development of DGAVCDecNV. Communications from me are rendered in italics; communications from Nvidia are rendered in normal font. I want to take this opportunity to thank Nvidia for their highly competent and responsive support. Note that names and salutations have been removed.

Most of the posts from Nvidia were by an Engineer "O". A few were by engineer "E". I have marked E's posts as [Engineer E]. O's posts are unmarked.

I make an open source frame server for AVC video:


Currently it uses libavcodec.dll for decoding. I want to use Nvidia GPU via the CUDA Video Decoder API.

My application does the parsing of NALUs and then pushes them into the decoder and receives a picture. I can also push the bitstream for a whole picture at a time.

My question is whether you think this model can be used with CUDA. The sample you give uses its own parser and reads from a file. I don't want or need all that. I just want to push the bitstream myself and get decoded pictures back. I'll also need to reset the decoder in order to support seeking.

Having popular open source software using your graphics engine would be beneficial for you as well as for the users of the software.

Your advice will be greatly appreciated.

[Engineer E]
I'm not familiar with NALUs. If your own parser can demux the video bitstream, it can be fed to the hardware. We have worked with other developers that are doing just that.

I'm adding [O], who may be able to answer your question on how to support seeking.

Yes, this should be fairly easy to implement, though you probably still want to rely on the CUvideoparser in order to push the elementary stream and have it call you back with data to pass directly to cuvidDecodePicture and cuvidMapVideoFrame (you can feed raw NALUs directly to the parser without using Cuvideosource by creating your own CUVIDSOURCEDATAPACKET).

You can flush the decoder by simply setting the EndOfStream flag on a dummy empty packet (set flag to CUVID_PKT_ENDOFSTREAM).

Whenever you call cuvidParseVideoData, the parser will call the pfnDecodePicture (in decode order), and pfnDisplayPicture (in display order). Simply call cuvidDecodePicture from your DecodePicture handler, and cuvid{Map|Unmap}VideoFrame in your DisplayPicture handler, where you can do whatever processing you want on the decoded pictures.

The attached sample code albeit inefficient, should be a closer sample for what you're trying to achieve:

It will read a raw H.264 elementary stream from disk, decodes the output, transfer it back to CPU-visible memory, then converts the output from NV12 to YUV 4:2:0 [horribly inefficiently] and writes the output to a uncompressed YUV file.

Let me know if this helps.

[attached file: cuh264dec[1].zip]

Thank you very much for your responses.

I will be experimenting with this starting tonight.

I have a further question, if I may. I implement a timeline and jump around on it. When I seek on the timeline, I start parsing forward for a reference frame and then decode it, display it, and stop. Now your decoder will be in a state expecting to receive and decode the following data. But if I immediately jump again, the decoder can become confused. With libavcodec, I found that I had to close and reopen the codec. Will I need to do something similar with CUDA and how would I do it?

At this point, the API is still fairly new, so we're still open to any suggestions for future versions.

For decoding immediately after seeking, one simple way is to deliver a dummy EndOfStream packet (to flush the decoder and reset the internal state).

Note that resetting the state also means that the decoder will not start decoding again until it gets a valid SPS&PPS NALUs, so if you want to seek on non-IDR frames that are not preceeded by a SPS/PPS, you may want to send a dummy SPS/PPS NALUs to the decoder (This would also apply to streams that do not contain the SPS/PPS as part of the elementary video stream, ie: MP4).

Yes, I am aware of the need to inject the SPS/PPS, because I have this working with libavcodec. I want that behavior because the SPS/PPS may have changed at the seek point.

If I inject the SPS/PPS and then starting feeding an open GOP, will your decoder return only the first decodable frame, or will it return junk for the frames with orphaned references?

I would have to check. In the case of MPEG-2, it will by default drop the first 2 B-frames, but for H.264, it's likely that the parser will return all frames, including the ones that have missing references. Is there a specific behavior you would want to see?

Unless restricting seeking to IDR frames, it is difficult to tell for sure if a frame has a dependency on a previous frame or not (since we can't predict infer the DPB state, although the AVCHD/BD restrictions make it similar to MPEG-2 in practice)

The frame number returned, is it the POC or frame_num? It is useful for me to have the POC.

Both are present in the CUVIDH264PICPARAMS structure (CUVIDPICPARAMS.CodecSpecific.h264):

'frame_num' corresponds to the frame_num syntax element
'CurrFieldOrderCnt[0]' contains the POC of the top field
'CurrFieldOrderCnt[1]' contains the POC of the bottom field

For a frame picture, POC is min(CurrFieldOrderCnt[0], CurrFieldOrderCnt[1]) (iirc, both values should be identical for a progressive frame). For a field picture, POC is CurrFieldOrderCnt[bottom_field_flag].

I've run some experiments and run into serious difficulties.

First let me say I have an 8500GT which your SDK decoder tells me is CUDA 1.1. I wonder if this is the cause for the issues below.

I decoded a 640x480 H.264 video with the simple decoder you sent me. Everything went fine and I was greatly encouraged.

Then I decoded a 1920x1080 video and the decoded result was wrong. It scrolled around as if the dimensions were wrong for display. When I ran it with the SDK sample decoder, it crashed my computer. You can try it by getting the stream here:


But I also noticed that the decode rate was very slow at 12.6 fps! Much slower than I think can be accounted for by the NV12 conversion to YUV.

I would appreciate your help in resolving these issues. Perhaps I need to upgrade my hardware to support CUDA 2.0?

The 8500GT should fully support the decoding - sounds like a fairly serious bug if it was able to crash the computer. The 12fps problem is most likely related to the crash.

Performance for 1920x1080 decoding should be around ~45fps on 8500GT.

I will take a look at the stream. Can you share some information about the 8500GT you're using (board manufacturer, clocks, memory size?), and which driver version you're currently using?

I am able to reproduce the problem with the SDK sample app on 8800GT (didn't occur with an older version of the sample app - I suspect an issue in the nv12toargb kernel, though the fact that this can cause DX to hang is definitely a bug on the driver side).

Decode perf is about 65fps on my test system (should be the same than on a 8500GT).

There appears to be a problem in the sample app's cuda<->d3d9 interoperability. During the failure, any display access will be blocked for 5s or so, which appears as if the system is hung (waiting a few minutes should recover, but that is an issue on the driver side).

The problem doesn't appear related to the actual decoding of HD H.264 content, but rather in displaying it through d3d<->cuda. (The DX backbuffer size setup in the 2nd cmdline sample is irrelevant, since it is unused for display)

Also, if you run the command-line sample without actually converting to YUV and writing the uncompressed output to disk (not specifying an output file on the command line), you should be able to view the actual decode rate (At ~3MB/frame, a typical disk write bandwidth of 40MB/s would otherwise limit the throughput to ~13fps).

FYI, I have tried two different HD h264 samples and both failed the same way. I will try more this evening.

I can send the info you requested this evening.

I did notice in the simple decode file you sent me some DX thing is set up at 640x480. I don't know what that's for but when I changed it to 1920x1080 it made no difference.

Update: if you rebuild the SDK sample application with USE_DRVAPI=0 in videoDecode.cpp, the problem goes away (not clear why yet).

When you say the SDK sample, do you mean the shipped one in the SDK or the minimal one you sent me?

If the former, is there a similar workaround for the minimal application you sent me? That one doesn't hang but gives a wrong display.

I was talking about the shipped SDK sample. I haven't actually verified the minimal application sample, since just took out the videosource part of another internal sample app that used the CUvideosource api (I probably broke something along the way - I'll take a look and get back to you).

The fact that the video is rolling up or down probably means a pitch or height mismatch in the NV12->YUV420 conversion.

Apparently the height is being returned as 1088 instead of 1080. So the coded height is being returned instead of the display height.

Will try to hack around that for now.

[Engineer E]
I have updated the application, this should fix the issue with the crashing. I found a bug in the application where the Driver API was setting the wrong # of threads/block. This project should have everything except the video file.

The NV12toARGB kernel is not fully optimized for coalesced memory writes. An 8500GT is a low end part, because of the non-optimzied memory accesses performance for playback may be limited by this kernel. I will see if I can optimize the kernel to improve performance.

[attached file: cudaVideoDecode.zip]

Thanks, E.

I'm more interested in O's stripped down decoder, but it will be useful to have this one also.

Which 8xxx would you recommend that I use for best performance?

[Engineer E]
An 8800GT or 9800GT is probably a decent GPU for development purposes ($110-200). The cudaVideoDecode sample will run fine on an 8600GT, with optimizations I can most likely get it to run well on an 8500GT.

Actually, I'm not seeing the problem here. I used a standard yuvviewer tool to view the 1920x1088 output and it looks correct (note that it's writing the non-cropped output, ie 1088 and not 1080).

I got 3.5fps because I was writing the uncompressed file over the network, but if running without the disk i/o and yuv conversion, I got 55fps. It's suboptimal because decode and transfer are serialized, so the video processor is not fully utilized.

With proper pipelining (running decode in a separate thread), I get 65fps. It's possible that the perf delta might get worse on G86, although 12fps sounds a bit low for 3D (which is basically only doing a few unnecessary copies of the frame and a final transfer to system memory, which should occur at 2+GB/s)

The attached cudecode.cpp is the original test application I was using (relies on CUvideosource to have decode running in a separate thread). It also has a '-nodisp' command-line option to bypass any map/unmap/copy calls and only measure raw decode throughput.

[attached file: cuh264dec[3].zip]

I really appreciate your and E's responsiveness! The SPS in that stream specifies a cropping rectangle that makes the display size 1440x1080! Seems to me that you need to honor that and return the correct display size. Otherwise how will my code know what the display size should be? When I used 1088, I got the expected garbage in the last 8 lines. I got 50 fps when I turned off the YUV output. I'm less concerned (right now) about performance than about correct decoding. I will be testing a lot of different stream types, especially PAFF, which libavcodec can't get right.

Are we talking about holzi.264 ?

I'm seeing frame_cropping_flag=0 in this stream (frame_crop_left_offset, frame_crop_right_offset, frame_crop_top_offset, frame_crop_bottom_offset are all zero).

The decoder will always decode the full coded frame size. It was designed so that it is up to the client (or in this example, the YUV conversion stage) to crop the video to the desired dimensions before displaying or applying any additional processing. These restrictions are due to the underlying implementation using MS-provided DX components, but we should be able to fully support this in the future once we remove the dependency on D3D and make this API cross-platform.

The parser will return both the coded dimensions (coded_width, coded_height) as well as the cropping rectangle (display_area) in CUVIDEOFORMAT.

Oops, yes, it was a different clip. You are correct about holzi.264. It's good news that I can get the cropping rectangle. I'm happy with that. Will performance improve when the DX connection is broke? You will be happy to know that all the problem streams tested so far work fine on your decoder. I'm happy too, but still testing... To answer an earlier question, it returns garbage for orphaned reference frames at the beginning as you suspected. But I use the POCs to figure out with reasonable reliability what to ditch, so all OK. Your support is outstanding and greatly appreciated. I'm a field support engineer so I know how challenging it can be.

Glad to hear that.

Once we break the DX dependency, it should cut a lot of additional unnecessary copies that are being done with the 3D engine, but decode throughput will stay the same if 3D isn't a bottleneck (which means you would get 65fps on your 8500GT for the holzi.264 stream, but I'm pretty sure you can already get that with the cudecode sample).

I have been doing some testing on streams that libavcodec has problems with.

Out of 16 such clips, 3 showed problems when decoded with CUDA. The 3 play correctly with the CoreAVC codec.

I have uploaded them to my FTP site. I would appreciate your analysis and conclusion on what the problem and solution might be. Here are the FTP details:

user: guest@neuron2.net (enter it just like that)
pwd: xxxxxxx

It seems that the problem is the same for all 3, and I strongly suspect these are not compliant (The decode HW is very very strict on compliance, though there might be some fudging we can do in the driver).

The Cuvideosource will fail to detect mpeg4-hosed.264 as a valid elementary stream because of invalid nal_ref_idc (0) being specified for the PPS (I can modify that, but that's not a problem for actual decoding).

Looking at sample.264, it appears to be B-pyramid with 3-bframes, using B-frames as reference, and only one reference frame in the reference picture list.

I don't seem to see a problem except for the fact that it is truncated at a non-IDR point, so some of the reference frames will not be available for the first few B-frames that are then used as reference. It also seems a bit unusual to have multiple consecutive B-frames sharing the same frame_num value, even though these have nal_ref_idc!=0 (this can cause unpredictable results in the sorting of the reference picture list if they are used as a reference by a P-frame) -> I don't think this is compliant, but I'll have to double check.

Many encoders & decoders out there (including CoreAVC, libavcodec, MPC-HC, Cyberlink and x264) do not fully implement the H.264 specification, especially when it comes to DPB management and reference picture list reordering.

Update: looking more deeper into the DPB management, I'm finding out that the NVCUVID parser performs gaps_in_frame_num(), according to the spec (because we're not starting at IDR), but really this does more harm than good since the stream uses adaptive_reference_picture marking -> the dummy non-existing frames inserted never get properly evicted.

I should be able to improve that quite a bit (should also improve error resilience when actual frames get dropped). I'll run some more experiments and send you an updated nvcuvid.dll.

As it turns out, the B-frames do not appear to be used for reference by the P-frames (due to the adaptive reference picture marking), so there is no problem with the frame_num values. The problem turned out to just be a simple bug in the handling of gaps_in_frame_num(), causing the adaptive reference picture marking to not correctly evict the non-existing references.

There is still some visible corruption for the first few B-frames, but I think this is expected, since the forward references are missing (no longer propagates to the following frames).

Attached is an updated version of nvcuvid with the parser fix (rename to .zip). Note that this version also includes new enhancehents for supporting level 5.1 streams, but this requires the final 177.96+ drivers that have not yet been released (attempting to decode such streams with an older beta driver may result in unpredictable behavior).

E: can you also update the internal SDK with the attached binaries ?

[attached file: nvcuvid_080904.zip]

You are amazing!

I will test this tonight and have feedback for you by morning. Then I plan to get the performance optimized followed by testing of the random access strategy. If all goes well, I will modify my DGAVCDec suite so that it has a decoder "HAL" to allow for different decoders to be easily slotted in, and then I will roll out CUDA support in the suite. Believe me, a lot of people will be very happy to have fully correct decoding with Avisynth frame serving. CoreAVC refuses to give me an SDK, although they say they are "thinking about it". I'm blown away by the support you offer. Thank you so much. Give me your boss's email address and I'll tell him about it. :-)

Hi E,

I understand that 9800GT has PureVideo 2 while 9600GT has PureVideo 3.

If I want PureVideo 3, what are my options?

[Engineer E]
9800 and 9600 have the same Video Decode capabilities. They should be identical. I'm not familiar with PureVideo 3, is there a web link that has more details?

See here for where I saw what I mentioned about PureVideo 2 versus 3. It appears that the documents on PureVideo at the Nvidia site are old and don't describe the new silicon.


I'm thinking of a BFG 9800GTX card.

[Engineer E]
The Wikipedia information is incorrect. 9600GT has the same VP2 engine just like the 9800GTX and Geforce GTX 280's

OK, thank you. I'll get a 9600GTX then.

[Engineer E]
Give than an 8800GT (G92) ~ $110 and 9600GT (G94) $90 are very close in price, why not just get the 8800GT? 8800GT has 112 stream processors, vs 9600GT 64 stream processors.

Oops, I meant I would get a 9800GTX.

Bravo, O!

All clips working.

It is such a breath of fresh air after working with libavcodec.

Onward and upward...

Very cool!

That's very useful feedback for us, especially since it helps find issues early on (like the problem in handling of gaps in frame_num), and gives us more ideas about the different scenarios were the codec functionality can be used outside of normal real-time playback scenarios (will hopefully boost the priority of such projects in the future).

Btw, XXXXX is the director of Video Software at Nvidia - I forwarded him your previous mail, which was very good praise in itself :)

I'm thinking of getting a BFG 9800GTX card. What kind of fps do you think I can get with that at 1920x1080?

You earlier speculated that I would get 45 fps with my 8500GT and I measured 48 fps. But then you said later that I should be able to get 65 fps. Why did you change your opinion and why don't I get that (disk writes disabled of course)?

There are multiple things that can contribute to the perf. You can probably measure the current ideal perf with the cudecode app (compiled from cudecode.cpp, run "cudecode inputfilename", with no output specified).

My initial perf numbers were based on local measurements, where I tend to use difficult 30+Mbps 1080i content. For these streams, I would expect 42-45fps with full VP2 utilization (measured on 8500GT). For the streams you sent, I measured the frame rate to be around 65fps.

However, there are a few factors that may prevent reaching full VP utilization:

1- If decode runs in the same thread than display (ie: calling map() directly from within the display callback), the video processor will be idle (frequently happens, especially for frames where display order == decode order) In this case, the amount of performance drop depends directly on the speed of the transfer to sysmem and the speed of the GPU.

2- The video decode consists of multiple engines, so there should be at least 2 frames queued at any time for decoding in order to reach full throughput.

3- I am using a newer driver that includes some optimizations made to the per-frame overhead of the bitstream processor, which can save as much as 1ms/field (could be a contributing factor here)

When getting full VP utilization, without any 3D bottleneck, the perf of the 8500GT should be very close to a 9800GTX, since IIRC the video processor is clocked at the same speed on both, though this could depend on the manufacturer (usually it is running at 450MHz, but some low-end GPUs are only running VP at 400MHz).

In practice, this may be different if display is involved, due to deinterlacing, scaling and other additional post-processing (low-end GPUs may not be able to reach their peak decode throughput). Also, by default, the code I sent you will deinterlace interlaced content - which you probably don't want (you can turn it off by forcing the DeinterlaceMode to _Weave and/or mark all frames as progressive when calling cuvidMapVideoFrame)

Thank you for the thorough answer. Now, you've got me thinking it's not worth it to buy a 9800GTX. I will try to digest all this and do some experiments.

I should have been more specific: if the multi-threaded cudecode sample shows you less than 65fps, it means that 3D probably caps at whatever frame rate you're seeing (most likely due to the many inefficient transfers between cuda & d3d in the current implementation). In this case, you'll benefit from a faster GPU (though you probably won't see much difference between a 8600GT and 9800GTX, since VP2 perf is essentially unchanged).

If serialized with decode the average time per frame would be (VP2_time+3D_time), whereas for full pipelining it would get max(VP2_time, 3D_time). On mid-range to high-end GPUs, 3D_time is much smaller than VP2_time, so it's not really a big deal, but it would hurt more on low-end GPUs.

(Note that VP2_time itself can vary with pipelining since it's really a combination of multiple engines that can operate on multiple frames simultaneously)

But the 9800GTX+ has a faster core clock! Isn't that worth going for?

I ran your multithreaded example with no deinterlacing and -nodisp on holzi.264. I get 66fps. Good.

I ran it on a 1080i field-coded stream and got 48 fps (weave).

From what I can find out, the 9800GTX+ has the fastest "core clock" of any of the chips. Is that the one that clocks VP2?

So I can get 66 fps with GPU on holzi.264.

I decided to see what CoreAVC would do on my E8500 @ 3.8GHz. It did 60 fps.

So you are better even on a high-end CPU.

But it's the correct decode that is the big deal for me.

Nice work!

Yup. On simple streams, VP2 will usually not outperform a highly optimized SW decoder on high-end CPU in terms of raw decode performance (this is especially true for MPEG-2), since the programmable part of VP2 is essentially a 450MHz SIMD processor whose only advantage is that it's quite efficient compared to an x86 CPU).

On the other hand, the perf of a SW decoder will drop significantly when facing a 40Mbps CABAC stream with a single slice, whereas HW perf won't change very much (I've never seen VP2 falling below 40fps regardless of content).

Unfortunately, VP2 runs in its own clock domain (usually fixed at 450MHz).

Video engines are typically designed for low-power, more so than raw absolute performance, and the biggest gains come from new generations (next-gen video engines will be quite a bit faster, especially for H.264, but you may have to wait a few months).

[Btw, I think holzi is the clip where I did measure ~66fps as well on 8800GT, so this makes sense]

Ah, so at least I save some money because 8500GT is already as good as it can get (currently).

But then, E wrote this:

"The NV12toARGB kernel is not fully optimized for coalesced memory writes. An 8500GT is a low end part, because of the non-optimzied memory accesses performance for playback may be limited by this kernel. I will see if I can optimize the kernel to improve performance."

Do you think it is worth upgrading to 8600GT or better? He said the 8600GT is better optimized.

Hmm, I guess that is irrelevant if I am doing the NV12 conversion myself.

Yup - it wouldn't help for this particular case. The only benefit would be that the GPU load from just deinterlacing and/or copying the decoded frame would be less.

I'm not 100% sure, but it's also possible that the deinterlacing quality might be better on the 8600GT.

Since we rely on default D3D behavior, the deinterlacing mode used is the default simple adaptive deinterlacing, which is tweaked for good real-time performance at 60Hz with scaling to high display resolutions. I think automatic 3:2 pulldown detection might also be disabled for the same reason (not an issue unless the content is incorrectly flagged - which seems unlikely for H.264)

Again, this is the kind of stuff that we could have much more control over in a cross-platform implementation, but the primary goal was to enable 3rd party applications to use their own algorithms -> This is where the GPU really shines when it comes to post-processing operations, where it's just massive per-pixel operations.

(Operations like multi-tap scaling or noise reduction can be orders of magnitude faster on a midrange GPU compared to high-end CPUs)

Good morning,

That raises a question.

If the pictures have pic timing that specifies pulldown (or repeat flags for MPEG2), does the VP2 implement perform the pulldown or does it just return the frames and rely on the display application to do the pulldown?

For libavcodec, I just get decoded frames and then my SW will either implement the pulldown or not, depending on a user option. Please explain my options with VP2. I need to control if pulldown is honored or not.

VP2 should work the same way than libavcodec, pulldown flags are irrelevant to the decoding process.

If pic_timing is present, the parser will set the repeat_first_field flag accordingly, in order to unify the MPEG-2 and H.264 (actually means number of extra fields, so repeat_first_field could be as much as 4 in the case of frame tripling, for example 24p content coded as 60p). The frames should already be flagged as progressive.

For changing the frame rate to 24Hz (essentially converting the 3:2 cadence into a 2.5:2.5), it's entirely up to your application.

Ah, you work weekends, too. Excellent.

I hope they pay you overtime. :-)

That's just the answer I wanted to hear! You guys are brilliant.

This morning I rewrote DGAVCIndex to use an abstracted interface to the decoder:

// API for video decoder extern int decoder_open(void);
extern void decoder_close(void);
extern int decoder_reset(void);
extern int decoder_decode_nalu(int *frameFinished, unsigned char *buf, int len);
extern void decoder_copy_frame(unsigned char *y, unsigned char *u, unsigned char *v);
extern unsigned int decoder_get_width(void);
extern unsigned int decoder_get_height(void);
extern int decoder_get_poc();

(DGAVCDecode needs a little more data, like the repeat flag.) Now all I have to do is implement those functions using VP2/CUDA.

It's looking good. :-)

Excellent! Let me know when you have a beta build, I'd like to experiment with it myself.

How is CurrPicIdx managed?

For example, what keeps it from getting too big for the size of your frame store? Is it an arbitrary number unrelated to the stream? When does it get initialized, etc.?

Also, your second decoder you sent me that is faster than the first... Is it faster mostly because it can decode more than one frame at once and thus use the VP2 more efficiently?

And it uses a video source to push data. Can I replace your HandleVideoData() with my application NALU pusher without affecting the operation?

The value of CurrPicIdx is always in the range of [0..ulMaxNumDecodeSurfaces-1] (ulMaxNumDecodeSurfaces is passed in during parser initialization)

The parser will recycle the oldest frame buffer that has been passed in for display. If you implement a decode->display queue, you need to synchronize before calling cuvidDecodePicture, to wait until the target frame index used to decode a new picture is no longer needed for display (for example by associating a mutex with each frame buffer). The 2nd sample does this with a dumb Sleep(1) loop and a flag (FrameInUse[]).

The reason why the 2nd example is faster is the following:

Think of the decoding engine as a 3 different processors:

1. The first processor performs bitstream decode (VLD)

2. The second processor performs motion vector prediction (MV)

3. The third processor performs the actual motion compensation and deblocking (VP). It is then followed by a display (3D) operation (assuming display order == decode order for simplicity)

4. Deinterlacing+Cuda Mapping+Copy to system memory (3D)

The output of each processor is fed into the next one, and although the synchronization can occur at sub-frame granularity, they each can operate on different pictures.

In the second sample (cudecode):

VLD can start working on picture #4 while MV is still on picture #3 and VP is still on picture #2, and 3D is copying picture #1.

In the first sample (cuh264dec):

When the CPU is waiting for 3D to finish work on picture #1, VLD,MV and VP are all idle (except for the frames where decode order != display order). Because most B-frames are ready to be displayed immediately after being decoded, this situation will occur ~50% of the time, preventing you from reaching full VP utilization.

You can definitely replace HandleVideoData by your application - it should just work as long as HandleVideoData is running in its own thread. The key to good perf is that HandlePictureDisplay queues the frame in a display queue instead of calling MapVideoFrame directly. It can be done without multithreading as well, by just managing a display queue (If I have some time, I'll try to implement something like this in the cuh264dec sample)

Thank you for your very thorough explanation and guidance.

No problem.

Attached is a modified version of cuh264dec with a simple decode->display delay implementation. Ideally this should have the same perf than cudecode.

Note: I have not tested the code at all (working remotely), I only know it compiles and that's about it :) You can experiment with the value of DISPLAY_DELAY, my guess it that DISPLAY_DELAY=2 should be enough to hit max perf (and any value beyond 4 is a waste).

[attached file: cuh264dec[2].zip]

Thanks, I'll let you know how it works.

With DISP_DELAY=4 I get 56 fps compared to 66 for the other one.

There is no -nodisp option so I just omitted the output file name.

Haven't looked at the code yet to try to see why it's 10 fps slower.

So we have:

version 1: 48 fps
version 2: 56 fps
version 3: 66 fps

How much do you get in cudecode without the '-nodisp' option, but without an output filename ?

I'm guessing the times should be similar. On a 8500GT, it's possible that the 3D processing takes some bandwidth away from VP2, thus reducing the efficiency (?).

I'll test that later.

Right now I have a serious problem!

I have implemented the first version (just because it is simplest) in my app. All the init succeeds and I start feeding NALUs. The video sequence callback is hit and has all the right data. But then when it tries this:

        // Create the decoder
        result = cuvidCreateDecoder(&state->cuDecoder, &state->dci);
        if (CUDA_SUCCESS != result)
            printf("Failed to create video decoder\n");
            return 0;

It gets a failure with CUDA_ERROR_NO_DEVICE. I can't figure out why. Help!!!

Could it be a problem with my threading?

I init the decoder system in my main window thread and push NALUs in a different thread.

Yup, that's it.

You have to start everything in the same thread that pushes the NALUs.

Why is that and does it have to be that way?

Yes. A restriction of cuda is that the context is always associated with a single thread (though there are ways to get around that).

Because we use D3D for decode that retriction doesn't apply to cuvidDecodePicture, but it does apply to every other cuvidXXX function. There are two ways to solve this:

1. Create the CUDA context (cuD3D9CtxCreate) in the same thread that creates/destroys the decoder

2. Use floating contexts (a bit more complicated, but you never have to worry about threading):

After calling cuD3D9CtxCreate, do the following:

CUcontext myContext;
CUvideoctxlock myLock;

cuvidCtxLockCreate(&myLock, myContext);

When creating the decoder, set CUVIDDECODECREATEINFO.vidLock = myLock. Then, whenever you want to make any cuda calls (such as cuMemCpyDtoH), do this:

... // cuMemcpy, cuvidMap/Unmap etc...

This will attach the context to the current thread, and automatically synchronize multiple threads that are competing for the same CUDA context.

See my previous mail.

I had some heated arguments with the CUDA designers about this, because I thought it was a ridiculous restriction in this day and age, but their answer was that it's similar to the way OpenGL works blah blah blah (what's worse was that there was no way to synchronize access to the context).

To get around it, I added the cuvidXtcLock objects so that multiple clients can have a common way to synchronize.

Let me know if this works - I have to go hang some shelves in my garage before my wife gets mad :)

Thanks. I have worked around it and will think about a proper solution later.

I succeeded to display the first picture in my application after opening a stream. Woo hoo!

But now I have another problem.

After I decode the first picture I just stop injecting NALUs. Then I do the reset trick because I want to jump forward by a GOP for GOP stepping in the GUI.

     pkt.flags = CUVID_PKT_ENDOFSTREAM;
     pkt.payload_size = 0;
     pkt.payload = NULL;
     pkt.timestamp = 0;
     cuvidParseVideoData(state.cuParser, &pkt);

As soon as I execute the cuvidParseVideoData() call, I immediately get a HandlePictureDecode() callback and it crashes on calling cuvidDecodePicture().

Oh crap. I just thought of why. When I stop decoding after the first picture, I kill the decode thread and start it again for the next GOP. I can't do that according to what you said. What a pain in the ***. Back to the drawing board. I hoped not to have to re-architect my application. Maybe I can just close and re-open everything? Yucky.

While you work on your shelves I am going to replace my stock CPU fan. I hope your wife is happy so you can be. :-)

Looks like I need floating contexts and then attach the context when I start the decoder thread each time and release it when it ends.

Hope it works!

Hmm. There might be something else that is going on. cuvidParseVideoData and cuvidDecodePicture do not need the context to be active.

You also shouldn't need to re-create the decoder unless the resolution changes.

Let me know when you have a beta - I'll give it a try and see how it performs on VP3 :)

PS: My shelves are up and the wife is happy :)

I'm having a problem.

I decode and display my first frame and then try to stop by stopping injecting NALUs and then use the reset trick:

>      pkt.flags = CUVID_PKT_ENDOFSTREAM;
>      pkt.payload_size = 0;
>      pkt.payload = NULL;
>      pkt.timestamp = 0;
>      cuvidParseVideoData(state.cuParser, &pkt);
>      return 0;

But as soon as I call cuvidParseVideoData() above I immediately get HandlePictureDecode() and HandlePictureDisplay() callbacks for a few frames.

But I am resetting the decoder; I don't want those callbacks! It seems that just signalling EOS isn't good enough, because cuvid will want to deliver the remaining frames in the pipe. I need to flush the pipe with my reset.

Then, worse, when I subsequently try to start playing again by just restarting the NALU flow, it gets caught in a tight loop in nvcuvid.dll and never makes any further callbacks.

Any ideas, please?

Sounds like there might be a problem in the parser after EOS if you see a spinloop within nvcuvid.

EndOfStream will flush pictures pending for display, so it is behaving as expected. If you want to essentially completely reset the parser and do not want any callbacks, one easy way to achieve this would be to just destroy and re-create the parser (the overhead should be minimal as the parser is completely independent of the HW, so it's just a new/delete operation).

(Maybe I can add a _RESET packet flag, although if the caller knows it wants to ignore the callbacks, it can also easily be achieved in the callbacks themselves, by simply doing nothing in the callbacks)

It would be great if you can share a binary that reproduces the problem where nvcuvid gets stuck in a tight loop, that's definitely something I'd be interested to fix right away.

I went to put the latest binary together and send it you but when I tested it one more time the tight loop wasn't happening anymore! I had made some changes so I'll keep an eye on it and if it happens again I'll send it to you.

I'll try the parser destruction.

Parser destroy/recreate works great. Thanks for the tip,

I have some nonoptimalities in my app right now but with one 1080 stream I get 34 fps and with another I get 24 fps.

Can the complexity of the stream account for that big difference? If not, what else?

It's difficult to tell when everything is serialized, simple things such as the 'Sleep(1)' loop waiting for for the cuMemcpyDtoH to complete can easily cost an additional 10ms randomly, since it's highly dependent on the OS scheduler.

Also, any additional processing (such as the NV12->YUV420 conversion) will occur while the gpu is idle.

The difference between 34fps and 24fps is 12ms/frame, so that's definitely well above content complexity differences, -> it's unlikely that this overhead would be due to VP (Most likely, the GPU is idle most of the time).

That's very interesting. With the same app I have a 720P stream that plays at 77 fps!

Is there any utility or anything like that I can use to assess my GPU utilization percentage? That would help to optimize my application viz-a-viz GPU utilization.

Remember that I am still running with the system based on the first version you sent.

I suppose it is worth getting rid of the Sleep loop. as always thanks for your great support,

I can't account for why the txwild2 stream plays so slowly, even considering what you have told me.

ftp: ftp.neuron2.net
usr: guest@neuron2.net
pwd: xxxxxxx
file: txwild2.264

Using your fastest decoder (with the created video parser) and no outfile and with -nodisp, I get:

holzi.264: 66 fps
txwild2.264: 47 fps

What can account for this difference??? It really baffles me.

Meanwhile I'm trying to understand the difference between the three versions you sent me:

version 1: slow
version 2: fast, with a file parser in the main thread
version 3: fastest, but with a created video source object

Are the following assertions true?

1. Version 1 is slow because the pipeline goes idle while doing picture copy/display in the main thread.

2. Version 2 is faster because it can keep the pipeline full with multiple frames thanks to the display delay.

3. Version 3 is faster yet because it has the video source in its own thread, and so the parser is not blocked by display activity.

Sorry for being such a pest. But I think I am almost in a position to go forward with much less guidance. I plan to document all this on my web site to benefit others.

Versions #2 and #3 should be identical in perf, unless you use the '-nodisp' option in version #3 (which isn't a fair comparison since there is no transfers/post-processing and the GPU is idle, so 100% of the bandwidth is available to VP2).

On a high-end GPU, the perf with or without -nodisp is almost identical (since there is plenty of bandwidth available to VP2, whereas the 8500GT is only a 128-bit bus, so 3D tends to be memory-bound). Can you double-check the perf differences between #2 and #3 without using the '-nodisp' option?

I'll take a look at txwild2 tomorrow, but it doesn't seem very unusual: my worst-case 1080i clips can indeed run as low as 40fps, so the 47fps isn't too far off (I was actually surprised to see vp2 go as high as 66fps).


version 3 without -nodisp: 62.7
version 3 with -nodisp: 66.3
version 2: 62.8


version 3 without -nodisp: 45.7
version 3 with -nodisp: 46.9
version 2: 45.8

So, you are correct about that.

>On a high-end GPU, the perf with or without -nodisp is almost identical
>(since there is plenty of bandwidth available to VP2, whereas the 8500GT
>is only a 128-bit bus, so 3D tends to be memory-bound)

That tells me I can benefit a bit by upgrading my card (holzi.264 would gain 4 fps). Which GPUs are "high-end", i.e., do not have a 128-bit bus? p> >my worst-case 1080i clips can indeed run as low as 40fps

OK. But I am still curious about what accounts for the big difference. It may help to advise people to avoid certain things when encoding.

When will VP3 be available and what will it bring to the party regarding performance and features?

I have the decoder basically working in my app using version 1. I did it just to expose any issues there might be, and it succeeded in that because it brought the threading and reset issues to light. Now I plan to upgrade to the version 2. Then I'll address nonoptimalities in my own code, e.g., I don't know why my app is showing 25% CPU utilization on a 3.8GHz machine while playing.

I'm using a 256MB 8800GT (256-bit bus) to test this, and I'm getting similar perf results:

Without -nodisp: 46.5fps
With -nodisp:    47.2fps

Without -nodisp: 65.8fps
With -nodisp:    67.8

It's quite odd for a progressive stream to stress VP2 so much, but I'm guessing it may be due to the stream using a lot of small partitions (4x4 motion vectors), that may be the biggest bottleneck for VP2.

Although in a sense, it's more holzi.264 that is more unusual, since the typical range for decoding 1920x1080 on VP2 is 45-50fps. As a rule of thumb for bus width:

X200,X300,X400 (9200/9300/8400/9400) -> 64-bit bus
X500 (8500/9500) -> 128-bit bus
X600 (8600/9600) -> 256-bit bus
X800 (8800GT/9800GT) -> 256-bit

The CPU utilization could be due to the slow NV12->YUV420 (let me know if you want me to send you a MMX version using intrinsics -> should be supported by gcc as well)

Some new data for you...

1. I converted my application to use the delayed display method (version 2). It is working at the expected decoding rate. Wonderful!

2. I found that turning off ASYNC_COPY improves performance by 3-4 fps.

3. The big killer for CPU and frame rate in my application turned out to be the YV12->RGB conversion for display. So I guess the MMX code you offered for NV12->YV12 isn't really needed. But if you can provide really fast code for YV12->RGB (with both interlaced and progressive modes), that would be greatly appreciated. :-)

Shouldn't the GPU do this and be able to return RGB directly?

Can you suggest a more efficient way to get the data on display in Windows? I convert NV12 to YV12 to YUY2 to RGB and then SetDIBitsToDevice().

When I have a clean beta I will send it to you.

The GPU can do RGB conversion very fast (when properly optimized to use the texture unit, unlike the cuda SDK sample :).

Unfortunately, most Windows APIs only deal with RGB, so the only way to get access to GPU-based YUV->RGB conversion is to use a MS-provided renderer to handle the display (definitely not trivial if you're not running in a filtergraph). Alternatively, it could be done using cuda, the copying the RGB data in sysmem, or using the CPU (not very efficient for RGB24, but RGB32 shouldn't be that bad).

Usually, YUY2(4:2:2)->RGB is faster than YV12/NV12(4:2:0)->RGB, probably why most RGB conversion routines upsample to YUY2 first. The first step would probably be to do a NV12->YUY2, rather than going through the intermediate YV12 (should be just as fast as NV12->YV12, and that's the only step where interlaced/progressive flag really matters).

The second step would be to either display directly in YUY2 (may not be possible if your application uses GDI), or use a MMX-based YUY2->RGB32 (IIRC, GDI should support RGB32)

I've been pretty busy today, but if you can send me a code snippet for your existing YUY2->RGB, I can send you back a MMX-optimized version within the next couple of days.

I just threw together a simple MMX NV12->YUY2 conversion routine.


static void NV12toYUY2(const unsigned char *py8,
                       const unsigned char *puv8,
                       unsigned char *pyuy2_8,
                       int width, int height, int pitch, int interlaced)
    int w = (width+7)&~7;
    const __m64 *py = (const __m64 *)py8;
    __m64 *pyuy2 = (__m64 *)pyuy2_8;
    for (int y = 0; y < height; y++)
        const __m64 *puv;
        if (interlaced)
            puv = (const __m64 *)(puv8+(((y/2)&~1)+(y&1))*pitch);
            puv = (const __m64 *)(puv8+(y/2)*pitch);
        for (int x = 0; x < w; x++)
            __m64 r0 = py[x];       // yyyy
            __m64 r1 = puv[x];      // uvuv
            __m64 r2 = r0;          // yyyy
            r0 = _m_punpcklbw(r0, r1);  // yuyv
            r2 = _m_punpckhbw(r2, r1);  // yuyv
            pyuy2[x*2] = r0;
            pyuy2[x*2+1] = r2;
        py += pitch/sizeof(__m64);
        pyuy2 += (width*2)/sizeof(__m64);

Thank you very much for your last two mails and code for NV12 conversion. I will hold that for later.

Right now I am working through the implications of the display delay for my application. The display and decode are now not aligned and I need to make some changes to allow for that

What is the significance of MAX_FRM_CNT? I thought it would just be the size of the pool of frames but it seems to be more than than. It seems to affect the decoder in some other way, although I haven't quite pinned that down yet. It seems to me that if it just determines the pool of frames that is cycled through, then it could be equal to DISPLAY_DELAY or maybe DISPLAY_DELAY+1. But you have it at 16.

If I seek close to the end of the file, I find that because the data is pushed ahead of display that I hit EOF and terminate before I get my display event. Now, if I reduce MAX_FRM_CNT it seems to affect when I get my EOF event. So that's why it seems that MAX_FRM_CNT doesn't just determine the size of a pool of frames.

I'll try to pin this down more, but your thoughts would be appreciated.

MAX_FRM_CNT should indeed only control the size of the pool of frames. Out of all these frames, some may be kept by the decoder (for frame reordering and reference frames). Ideally, MAX_FRM_CNT should be num_ref_frames+1+DISPLAY_DELAY. This also means that the maximum number of frames in the display queue is MAX_FRM_CNT-num_ref_frames-1, so using a too low value for MAX_FRM_CNT could reduce the display queue fullness (also, on XP, MS has a restriction that limits the maximum number of decode render targets to 16)

Ideally, when you reach EOF, you want to send a EOS packet to the decoder (will display any pending frames), and only then trigger the EOF event.

Ah, yes, send an EOS and then flush the display queue after waiting for the callbacks to finish. That works great. Thanks.

So I leave MAX_FRM_CNT at 16. Is anything gaind by having DISPLAY_DELAY greater than 2?

Now I'd like to raise again an issue we discussed earlier. Suppose I want to step forward by GOP along the timeline. To step forward, I parse to an IDR and start decoding. That appears to work fine because the GOP is closed. But if I have a recovery point or an I slice (not IDR), I should be able to step to it and display a good picture. Indeed, libavcodec returns me a good picture, even if the GOP is open. But your decoder returns me a bad picture (missing references) and I'm not sure what to do about it. Seems to me it is the decoder that can know if the picture misses any references.

Your thoughts, please. I'd hate to have to restrict GOP stepping to IDRs, because many streams have really infrequent IDRs. I'd like to have you return a flag saying if the frame is missing any references or not.

This is currently a limitation in the parser within NVCUVID. Ideally, this should work the same way than MPEG-2, ie: by default, the parser should auto-detect the frames with missing references and should automatically skip the call to DisplayPicture in that case.

It's on my TODO list - I might have to parse recovery_point SEIs and other additional information, since there are some cases where there is no way to know for sure if a missing reference is actually needed or not (especially with B-pyramid). I've also seen some broadcast streams that do not have *any* keyframes whatsoever (only P-frames, with a partial intra refresh once in a while).

One workaround would be for you to check the POC values of the frame, compared to the POC value of the keyframe you really want to display (discard display for any frames with POC values less than the first keyframe), similar to what you're already doing with libavcodec (you can also use the timestamp value for that).

Ah, yes, of course. I can still use the POC, but the other way around!

Thank you. I will try that this evening.

Btw, there was a bug in the NV12toYUY2 routine I sent earlier (should be w=(width+7)/8 instead of (width+7)&~7) (I still haven't tested it, though)

static void NV12toYUY2(const unsigned char *py8,
                       const unsigned char *puv8,
                       unsigned char *pyuy2_8,
                       int width, int height, int pitch, int interlaced)
    int w = (width+7) >> 3;
    const __m64 *py = (const __m64 *)py8;
    __m64 *pyuy2 = (__m64 *)pyuy2_8;
    for (int y = 0; y < height; y++)
        const __m64 *puv;
        if (interlaced)
            puv = (const __m64 *)(puv8+(((y/2)&~1)+(y&1))*pitch);
            puv = (const __m64 *)(puv8+(y/2)*pitch);
        for (int x = 0; x < w; x++)
            __m64 r0 = py[x];       // yyyy
            __m64 r1 = puv[x];      // uvuv
            __m64 r2 = r0;          // yyyy
            r0 = _m_punpcklbw(r0, r1);  // yuyv
            r2 = _m_punpckhbw(r2, r1);  // yuyv
            pyuy2[x*2] = r0;
            pyuy2[x*2+1] = r2;
        py += pitch/sizeof(__m64);
        pyuy2 += (width*2)/sizeof(__m64);

Thanks for the correction to the MMX code.

I implemented the POC check and it is working right, until...

I am monitoring the POC in you decode picture callback. I see it going up nicely until it gets to near 256 and then it jumps back to a negative value. I know that's not right because I use the JM reference code to parse the NALU and it is 258 when this failure occurs. This suggests that you are not adding the PicOrderCntMsb as you should when the POC reaches its limit, or you are storing it somewhere in a signed variable that is too small.

Can you check it please because I'm so close to having everything working now and I can't control the POC you return to me. :-)

Hmm. That's odd - maybe it's only sending the Lsb (?). I'll double-check this - certainly seems like a bug, it's meant to be 32-bit everywhere internally.

Btw, I haven't looked into it yet, but are you sure there is no IDR whenever the POC value resets ? The POC values will reset to zero along with every IDR.

I just double checked, and the value is definitely 32-bit and does includes picOrderCntMsb.

One potential problem I found is that prevPicOrderCntMsb/Lsb are not reset at EndOfStream, which in theory could become problematic if seeking to a non-IDR frame without destroying/recreating the parser.

Do you have a specific clip where you see the unexpected wraparound at 256 ?

Looking at the code, the only way POC would reset to zero is under the following conditions:

1. pic_order_cnt_type=0: when an IDR is received
2. pic_order_cnt_type=1: IDR or MMCO5
3. pic_order_cnt_type=2: IDR or MMCO5

I have uploaded the stream:


I have also uploaded my NALU parser (from JM reference software) that returns POC 258 when you return -256. The relevant functions are RestOfSliceHeader() and decode_poc(), the latter of which is called from elsewhere in my code.

At the failure, there is no IDR, in fact the stream has no IDRs, only recovery point SEIs. Therefore the POC must be non-decreasing the whole way through, but you suddenly jump from back to -256. The JM reference parser keeps going up at that point.

Also, at the failure, MacPicOrderCntLsb = 512, PicOrderCntMsb = 0, PicOrderCntLsb = 258.

Thank you for looking into it.

So far I'm not seeing the same thing here. I tried printing all the POC values in a test app. Note that I had to discard some of the data at the beginning of the file, since there is roughly ~0xcc000 bytes before the first SPS. The POCs increment correctly all the way through from the start.

Hmmm. Well, I just tried it in your simple decoder and not my app and you are right, things look OK. Could be my error. :-)

Sorry if it was a false alarm. Investigating...

It's definitely something to do with the decoder reset that I do before seeking. If I play straight through from the start in my app the POC is good. If I seek to the GOP starting with POC 258 (where the failure occurs), do a decoder reset, and play then the POC is wrong. If I do the latter but omit the parser destroy/recreate but still flush out the DisplayQueue, then the POC is OK. Here is my reset function:

int decoder_reset(void)
	CUresult result;
	int i;

	if (state.cuParser != NULL)
		state.cuParser = NULL;
	// Create video parser
	memset(&parserInitParams, 0, sizeof(parserInitParams));
	parserInitParams.CodecType = cudaVideoCodec_H264;
	parserInitParams.ulMaxNumDecodeSurfaces = MAX_FRM_CNT;
	parserInitParams.pUserData = &state;
	parserInitParams.pfnSequenceCallback = HandleVideoSequence;
	parserInitParams.pfnDecodePicture = HandlePictureDecode;
	parserInitParams.pfnDisplayPicture = HandlePictureDisplay;
	result = cuvidCreateVideoParser(&state.cuParser, &parserInitParams);
	if (result != CUDA_SUCCESS)
		return -1;

	// Flush display queue
	for (i = 0; i < DISPLAY_DELAY; i++)
		state.DisplayQueue[i].picture_index = -1;
	state.display_pos = 0;

	return 0;

How can this be happening???

Then I thought, OK, let's just not destroy/recreate the parser. But you said:

"One potential problem I found is that prevPicOrderCntMsb/Lsb are not reset at EndOfStream, which in theory could become problematic if seeking to a non-IDR frame without destroying/recreating the parser."

If you could send me a nvcuvid.dll with that fixed, I could probably get by with just not recreating the parser. Still I'd like to know why this is happening.

There's another funniness with the parser recreation that I haven't told you about yet, too. If I start and just make the parser once and decode txwild2.264, I get 24fps. But if I do that and then a decoder reset (remake the parser), I get 34 fps! I don't understand why. So if I accept the workaround to not recreate the parser I still have to do it twice at startup to get top speeds.

So some strange stuff is going on with the parser, IMHO.

Hope you can shed some light on it.

Attached is a version that will properly reset PicOrderCntMsb when flushed, so the behavior should now be consistent when doing a decoder reset or not. I'll try to duplicate the problem by simulating a direct seek to the picture with POC=285 - this is txwild2.264, right ?

We still can probably improve upon that, since it should be possible to use the previous frame_num, and avoid the gaps in frame_num altogether, but I need to do more testing since this could potentially have worse results (undetected gaps in frame_num values).

I'm not sure how creating/destroying the parser would affect the decode performance (?).

[attached file: nvcuvid_080910.zip]

Thanks for the update.

>Attached is a version that will properly reset PicOrderCntMsb when
>flushed, so the behavior should now be consistent when doing a decoder
>reset or not. I'll try to duplicate the problem by simulating a direct
>seek to the picture with POC=285 - this is txwild2.264, right?

Yes, I will try this too. No, we are talking about alba.264.

>We still can probably improve upon that, since it should be possible to
>use the previous frame_num, and avoid the gaps in frame_num altogether,
>but I need to do more testing since this could potentially have worse
>results (undetected gaps in frame_num values).

You lost me here. Improve upon what? Use the previous frame_num for what?

>I'm not sure how creating/destroying the parser would affect the decode
>performance (?).

I agree but I don't have the code to look at it, so I trust you. :-) I'll explore more tonight and let you know what I find out.

I did have another question for you. If you send EOS into the parser what does it do? How do you restart the stream if you've told the parser it's EOS. Can I seek by just stopping NALU injection, wait for callbacks to finish, ditch any picture callbacks, clear DisplayQueue, seek in file, inject SPS/PPS, and then start injecting? Just eliminate any decoder resetting at all?

> You lost me here. Improve upon what? Use the previous
> frame_num for what?

I was thinking of trying to detect the most likely value of PicOrderCntMsb when seeking to a non-IDR (instead of resetting everything to zero), but in theory this should not make any difference.

> If you send EOS into the parser what does it do? How do you restart
> the stream if you've told the parser it's EOS.

Sending EOS will do two things:

1. Empty the DPB (display all frames that have not yet been displayed)
2. Clear SPS/PPS and other internal state (PicOrderCntMsb etc)

> Can I seek by just stopping NALU injection, wait for callbacks to finish,
> ditch any picture callbacks, clear DisplayQueue, seek in file, inject SPS/PPS,
> and then start injecting? Just eliminate any decoder resetting at all?

No, because it would not flush any pictures that are pending for display. For example, in the case where the GOP starts with IBB, the display order would be BBI, so the I-frame will not be displayed until it is about to be evicted from the Decoded Picture Buffer, so if you send just the I-frame NALU and just stop injecting NALUs, you'll just get a DecodePicture callback but no DisplayPicture.

If you send a dummy EOS packet before stopping NALU injection, then everything should work as you just said.

You won't believe this!

So I was playing around with parser creation. I discovered that if you do it twice, before and after InitCuda(), then txwild2.264 decodes 10 fps faster. I expect you to be skeptical so I attach your original decoder version 2. Look for the conditionaled section with the extra parser create. Run with and without it and see the effect.

I'm thinking it has something to do with DX or D3D resources. The second create can't allocate some and so runs faster by eliminating some functionality.

But you're the expert.

OK, it's easy to demonstrate the problem with your code.

I attach your version 2. It has two conditional'ed sections:


This shows the 10 fps speed gain by creating two parsers, as I described in my last mail.


This seeks to an SPS (file offset 18849025 decimal) in alba.264 and then starts decoding. Break on the first decode picture callback. There will be a bad POC: -254 instead of the correct 258.

Note that the JM software makes the same error! So there must be some subtlety in the POC calculation that is being missed.

2 above has to be fixed or my application won't work. 1 above is just weird but welcome!

Your help would be greatly appreciated.

I think I know why the bad POC occurs.

The calculation for a frame involves PrevPicOrderCntLsb. If you have decoded the previous frame (as you would if playing linearly) then you will have a value in there. If you seek to the frame, you haven't decoded the previous frame and PrevPicOrderCntLsb is set to zero. So the calculation proceeds differently and in fact tracing it shows that this is the cause. POCs can only be correctly calculated starting by decoding linearly from an IDR! You can get lucky though, and get it right on some seeks. But on some it comes out wrong.

So what do we do? Allow seeking only on an IDR? What do we do with all the streams that have only recovery points and no IDRs? Or Is only? The POC comparison strategy gets blown out by this.

I am open to any solutions you may think of. Can your timestamp field help?

I don't think this is an error. The -254 POC is actually correct.

What is happening is likely to be the following:

MaxPOC = 512
POCMsb = 0 (unknown)
POCLsb = 258
PrevPOC = 0 (unknown)
-> (POCLsb-PrevPOC > 512/2) so final POC = 256-512 = -254

This is because of the very large discontinuity in POC values (greated than MaxPicOrderCnt/2).

This shouldn't actually be a problem, since you can't count on POC values starting from zero when seeking in the middle of a GOP anyway. The frames with missing references that you want to drop will most likely have a POC value that are less than -254 (-256 and -258 for the following two B-frames assuming a classic IBBP open GOP structure)

See my previous mail. I don't think there is a problem here, as long as you can deal with negative POC values.

You know the POC value of the first intra_frame that you're seeking to, and you should get callbacks in the following order:

DecodePicture(I_SLICE, POC=-254)
DecodePicture(B_SLICE, POC=-258) (bad pic, missing forward ref)
DecodePicture(B_SLICE, POC=-256) (bad pic, missing forward ref)
DisplayPicture(B_SLICE, POC=-258) -> reject this one
DisplayPicture(B_SLICE, POC=-256) -> reject this one
DisplayPicture(I_SLICE, POC=-254)

I think a possible solution would be: Remember the POC value of the first intra frame decoded after seeking (intra_pic_flag!=0), in this case -254 Ignore any DisplayPicture callbacks for pictures whose POC value is less than the first intra frame POC value

Actually, maybe even simpler -> ignore any display callbacks until an intra frame is displayed:

Bool MyIntraFrameFlag[MAX_FRM_CNT];

Towards the end of HandlePictureDecode:
If ((!pPicParams->field_pic_flag) || (!pPicParams->second_field))
    MyIntraFrameFlag[pPicParams->CurrPicIdx] = pPicParams->intra_pic_flag;

In HandlePictureDisplay:
If (FirstDisplayAfterSeek()
 && !MyIntraFrameFlag[pDispParams->picture_index])
    return 0; // ignore the display callback

That works great. Thank you so much.

Any comment on my multiple parser discovery? :-)

Everything is working great, except that I miss one small element. My GUI wants to display the frame type (I/P/B) for a displayed frame. But I cannot find any relevant information in the structures returned to the decode picture callback. Is there any way to get that information?

With H.264, every picture could in theory contain a mix of I/P/B slices, so it may not always be possible to be always 100% accurate, though in practice, most streams use the same slice type for all slices in a picture.

The closest thing would be the primary_pic_type from the access_unit_delimiter NALU that is present in BD/AVCHD streams, but that's not always there -> I'll see if I can add this to the API.

Something that should also mostly work would be the use of the following flags:

Intra_pic_flag (implies I)
Ref_pic_flag (ref_pic_flag=0 usually means B, though with B-pyramid, this may not always be true)

I would need to add a backward_prediction_flag to the structure, to would allow you to distinguish between P and P frames. On the other hand, it's always possible for you to parse the beginning of the slice header, as part of the bitstream data (CUVIDPICPARAMS.pBitstreamData always points to the first slice of the picture)

Thanks, you scored a hit again! In my non-GPU version I parse the NALU before decoding and note the slice type of the slice for the top left macroblock. But with the GPU version I no longer have alignment between my NALU parser and decode results. But you say that the bitstream is available in CUVIDPICPARAMS, so I can parse that. So no API change needed.

Sorry to harp on this but did you have any comment on my multiple parser creation discovery? You can reproduce it with the code I sent and txwild2.264. I'm just leaving it that way, but if I can get a 10fps increase by doing it, maybe other users of the VP2 are sacrificing some performance.

I will send you a beta of my app this evening. I'm just revising the users manual.

I haven't had a chance to look into the perf issue yet, but it's definitely on my todo list, since it's just plain weird (the parser doesn't even touch d3d or any other gpu-related stuff that could explain this).

In the async host transfer, you have a loop with Sleep(). I assume, or hope, it is a DMA transfer. Is there any way to get a more efficient signaling of the completion of the transfer than by means of a Sleep() loop? I am seeing high kernel utilization during decoding and want to eliminate as many system calls as possible.

Sleep(1) should make the thread go idle (will not cause any high utilization), it was meant to cause the least amount of system utilization, but there might be a few places in cuda that are missing cpu-friendly waits, especially when displaying other stuff through GDI. Do you also see high utilization when not displaying anything on screen ?

Thank you for your fast response. I will collect some more data this evening to answer your question.

Btw, I finally had time to look into the perf differences when creating the parser twice, but I'm not able to reproduce the problem locally (I used your modified source), I'm seeing the same perf in both cases. However, I am still seeing some minor perf differences (+/-5%) with the threaded decoder (should be the same perf), so I suspect there might be some timing discrepencies, maybe related to d3d (?).

Hmm, that's very strange.

I'll look some more at it, but the effect is easily re-creatable for me.

On another matter, I'd appreciate your advice. I believe that if I have different processes they can each initialize and start a CUDA session and they can both do decoding.

Here's the problem I have. The processes each use an Avisynth DLL that contains the decoder. The filter is coded in C++ objects so that it can be instantiated multiple times, and in fact some popular transcoding GUIs open the script several times, e.g., once for a preview and again for transcoding. That instantiates the filter twice.

So, it's easy to put the decoder global variables (Session, etc.) into member variables of the filter. But I don't know what to do about the callback functions. I can't make them member functions because they can't be linked to from the CUDA driver.

Can you suggest a solution that will allow my C++ coded filter instantiations to instantiate an independent decoding session?

Further to my last email...

I'm thinking I can put the this-> pointer in the pvUserData structure when I instantiate the parser. Then I can read it in the callback and invoke the member function that actually does the callback work, passing it the this-> pointer. Does that sound feasible? That assumes I can have the Session in a member variable and pass a pointer to that to CUDA.

Exactly: the pvUserData is intended for you to pass the this pointer of your class (or anything that would uniquely identify the callback with its context), and get it back as the first parameter of every callback.

(I probably should add a more detailed description in the header file)

Your callback probably can't be a member variable (has to be static), but you can easily work around that.

class CMyDecoder
   int OnDecodePicture(CUVIDPICPARAMS *pPicParams);
   // actual callback
   static int CUDAAPI global_decode_callback(void *pvUserData,

void CUDAAPI CMyDecoder::global_decode_callback(void *pvUserData,
  CMyDecoder *that = (CMyDecoder *)pvUserData;
  return that->OnDecodePicture(pPicParams);

Yes, I had already figured out the callback wrapper trick, but thanks for mentioning it.

Everything works great. It didn't for a while because I had given names to my Windows events, which caused them to collide across instances. I changed all the names to NULL, et voila, everything is fine.

Whoever thought of creating the pvUserData, bravo!

We had the idea of making the CUDA deinterlacer interpolate either the top or the bottom field. Then by interleaving the results of a decode with top and the results of a decode with bottom, we can get a nice bobber (double rate deinterlacer). The Avisynth script looks like this:

a=AVCSource("euro1080.hd5.dga",deinterlace=true,use_top_field=true) # 25 fps interlaced
b=AVCSource("euro1080.hd5.dga",deinterlace=true,use_top_field=false) # 25 fps interlaced
return Interleave(a,b) # 50 fps progressive

The use_top_field parameter controls the vpp.second_field variable in DisplayPicture(). But we get the same results for both filters. It appears that we can't control the deinterlacer in Adaptive mode like we wish.

Is there a solution? Is this a CUDA bug? My application cannot easily be modified to receive twice the number of frames from the decoder so we need to do it as above.

Yup. This should work on Vista, but currently doesn't really work on XP, because of the way VMR9 works. When I deliver a frame to VMR9 on XP, it is actually deinterlacing twice, and I'm throwing the second frame away.

It should actually be fairly easy to make this work on XP (same than default display behavior), by caching the second frame - I'll take a look at this tomorrow.

Btw, I think you should be able to hack this together by inverting the value of top_field_first, and keeping the same value of second_field.

Note that when skipping frames or fields, it may make it more difficult for the deinterlacer to keep track of moving objects (the default deinterlacer always assumes a continuous display of both fields).

The workaround appears to work. It would be nice to have it working the proper way though.

On another matter, I have some users saying that they sporadically get "ERROR: Failed to find CUDA-compatible D3D device(2)" and they have to reboot to get things working correctly again. Is there any understanding of this and possibly some solution?

> The workaround appears to work. It would be nice to have it > working the > proper way though.

I think I have a potential fix (see attached updated dll).

> On another matter, I have some users saying that they sporadically get
> "ERROR: Failed to find CUDA-compatible D3D device(2)" and they have to reboot
> to get things working correctly again. Is there any understanding of this
> and possibly some solution?

I'm pretty sure this is a problem that always occurred on the CUDA 2.0 Beta1 driver (177.84 iirc), but should be fixed twith the latest drivers (could also occur if there is a version mismatch between cuda & d3d). I'm pretty sure the official 178.xx WHQL drivers should be coming out very soon, and should resolve these types of issues.

Btw, nvcuda.dll and nvapi.dll should always be automatically installed along with the driver - so it's probably not a good idea to copy these manually (can cause system instability if there are version mismatches). On the other hand, nvcuvid.dll is an independent dll that doesn't have specific driver dependencies, so you can safely copy that one in any directory.

[attached file: nvcuvid_080923.zip]

I have released those new DLLs and will let you know if there is any adverse feedback.

I'm planning soon to send you some more clips that appear to have problems when decoded with CUDA.

Cool. Let me know when you have a pointer to the clips (hopefully it's just num_ref_frame level 4.1 violations that will be fixed with the upcoming driver)

I'm still doing more testing, but I also have a new version of nvcuvid with a few minor improvements:

- VC1 support
- Automatic removal of bad frames (the parser will examine the actual reference picture list and automatically skip the decode/display callbacks for frames that would otherwise contain severe corruption when starting to decode from non-IDR frames, so you no longer need to fiddle around with POC values to detect if a frame should be displayed or not)

I have uploaded the clips to the same place:

user: guest@neuron2.net
pwd: xxxxxxx
dir: xxxxxx

Let's start with the scantily clad ladies in cheer.264. It just macroblocks horribly. Claims to be Main Profile Level 3.2.

I'm REALLY REALLY REALLY interested in VC1 support. Please keep me informed!

The bobbing hack has some serious problems. Can we get it working correctly?

Here is a message from one of my users about it:


I'm copying the clips right now. From the name alone, I can already tell that "LossLessTouhou.h264" is not Main/High profile, so will not be supported by VP2 HW (along with any other 4:2:2/4:4:4 content) - That reminds me, I should modify the parser to reject these streams in the first place, rather than generate bogus picture parameters.

For the bobbing hack, it's a bit tricky. The latest version I sent should in theory work on XP, but will most likely not work on Vista. There is actually no easy way to make this work properly in all cases, because the whole thing is a bit of a hack right now:

1- Because we currently rely on default DirectX behavior, tweaked for realtime performance, there is no guarantee for the type of deinterlacing that will be used (to the contrary, I can guarantee that it's not the best deinterlacing mode available on the GPU)

2- On XP, cuvidMapVideoFrame will actually always deinterlace to 60p internally, but subsequently drop one of the frames. When skipping fields or calling cuvidMapVideoFrame for the opposite field, this may destroy any temporal information the deinterlacer might have had (the deinterlacer might see this as a discontinuity)

3- The deinterlacer might need a few frames to settle, ie: use a much simpler spatial deinterlacing for the first few frames after a perceived discontinuity (acceptable for display, but more problematic for frame-serving)

Basically, the deinterlacing is here as a convenience for display, but we really intended this to be bypassed for frame-serving type of scenarios, since the whole point of NVCUVID is really to allow custom post-processing using cuda. The only solid way of fixing this would be to move away from the MS D3D layers so we have much more flexibility in what we can provide (and also offer cross-platform at the same time).

That is not to say that there is no solution, but I think the safest way to make this work would be to use a single stream and consecutively call cuvidMapVideoFrame twice for every picture (output 60p directly rather from a single source, rather than use 2 sources + interleave). This would match what the renderer does during normal playback (less likely to run into problems) -> I need to make a few more changes to nvcuvid to get this to work properly, though.

I took a look at the streams, but didn't find anything unusual:

not compliant or missing many references (looks like encoder bug). Behavior is as expected, seeing the same result with JM and other SW decoders.

no visible corruption here (problems might have been due to the stream not starting on an I-frame)

lossless coding not supported (beyond High Profile)

verified that output is identical to reference decoder (just looks like high quantization to me)

level 5.1 (MaxDPB exceeds level 4.1 limits) plays fine with 178.13 driver released today

Got lots of warnings & errors from DGAVCIndexNV. Looks like this might be a transport stream with missing packets

Thank you for looking at the streams. I think you may be missing some things and I will send another email shortly about it.

But for now, there appears to be some problem with CUDA when running with remote desktop (also with radmin). For example, if I use remote desktop to connect to my machine and then start DGAVCIndexNV, I get CUDA init failure 100. Or if I have a transcode using DGAVCDecodeNV running and log off remote desktop, the encode hangs. Is there any solution?

I think that's unfortunately an OS limitation because the display driver gets unloaded with remote desktop operation. The exception would be VNC, which is more gpu-friendly in that regard.

>not compliant or missing many references (looks like encoder bug).
>Behavior is as expected, seeing the same result with JM and other SW

The problem here is that libavcodec (e.g., VLC player) and CoreAVC play this fine without macroblocking.

>no visible corruption here (problems might have been due to the stream
>not starting on an I-frame)

The problem I refer to is the single frame corruption right at the first scene change. This does not happen with libavcodec or CoreAVC.

>lossless coding not supported (beyond High Profile)


>verified that output is identical to reference decoder (just looks like
>high quantization to me)

The problem here is the macroblocking that does not occur with libavcodec or CoreAVC. See the attached JPG. The blocking is visible on the left side about half way down.

>video006.264: level 5.1 (MaxDPB exceeds level 4.1 limits)
>plays fine with 178.13 driver released today

OK. Good news!

>Got lots of warnings & errors from DGAVCIndexNV. Looks like this might
>be a transport stream with missing packets

Yes, there is corruption. But the CUDA decoder never recovers. Play it in DGAVCIndexNV from the start and you will see. If you move the cursor past the bad point and then F6 you can see that the data is good after the bad point. libavcodec and CoreAVC both recover after the bad data.

> cheer.264:
> The problem here is that libavcodec (e.g., VLC player) and CoreAVC
> play this fine without macroblocking.

The problem is that decoding this broken stream properly requires non-compliant DPB management, and may very well break with good (though unusual) streams that use long-term references (complex custom reordering).

I'll take a closer look anyway when I have some time, since it's likely that this comes from an encoder bug in the frame_num values of non-reference pictures (something that can only happen in a good stream with frame drops), and it's likely that there is a way to make this play properly with little or no impact on recovery with compliant streams.

> lux.hd.264: > The problem I refer to is the single frame corruption right at the
> first scene change. This does not happen with libavcodec or CoreAVC.

I haven't seen this here, so it is likely due to displaying frames with missing references (the newer parser will automatically skip these frames by default, similar to what libavcodec is already doing)

> MountainofFaith_track1.h264:
> See the attached JPG. The blocking is visible on the left side
> about half way down.

Again, I didn't see any visible corruption with the first 1000 frames (unlike the attached screenshot). Does this occur right at the beginning of the stream?

> stargate.ts:
> the bad point and then F6 you can see that the data is good after the
> bad point. libavcodec and CoreAVC both recover after the bad data.

I'll take a closer look. I've been spending most of my time trying to resolve the issue with doing deinterlacing at 2x the frame rate. Seems to be working fine on XP, but I still need to verify that everything is ok on Vista.

Attached is the most recent nvcuvid.dll, in case you want to try it out.

[attached file: deleted -- see next message]

Please ignore my previous nvcuvid binary (contained printfs in a few places). Attached is an updated version. This version also resolves the corruption with cheer.264, where the encoder incorrectly incremented frame_num for non-reference pictures.

[attached file: nvcuvid_080927.zip

Thank you for that! I confirm that cheer.264 works now. I understand that the stream is wrong, but if such things can be worked around in a sound way it is a good thing, especially since your decoder will naturally be compared to other software decoders.

I have a question for you about 64-bit support. I see that the ZIP you sent includes nvcuvid64.dll. When is that to be used? Are there also 64-bit versions of the video drivers (including CUDA)? I ask because I have a user who reports as follows. I believe he has the nvcuvid64.dll in place. Given that my app is 32-bit, should it use the nvcuvid.dll instead?

I have Windows 2008 (server version of windows vista).
It should by almost all means be considered the same
as a windows vista64.

Using driver 178.13, which is the latest I believe.

I've tried a few cuda demos from the sdk and they worked.

I'm getting the following error:

GPU decoder: Failed to create video decoder

That is put out by my app when cuvidCreateDecoder() in HandleVideoSequence() returns an error.

Any advice you may have about that would be greatly appreciated.

I will have a few more comments on the remaining bad streams to come later.

Agreed. In that particular case, I don't think there are any negative side effects from the workaround, so it's fine with me.

Nvcuvid64.dll should only be used if you're building a 64-bit application (will fail to load otherwise). A 32-bit app using nvcuvid.dll (32-bit) should run just fine on a 64-bit OS.

> Any advice you may have about that would be greatly appreciated.

Do you know what the error code returned is ? I'll try to add more instrumentation in nvcuvid - if the SDK samples work, it shouldn't be too difficult to track the problem down.

Thank you. I will tell the user and see what happens.

How does the right DLL get loaded if they both exist?

They have different names, so if your app likes to nvcuvid.lib, it will always load nvcuvid.dll (32-bit app), and if your app links to nvcuvid64.lib, it will always load nvcuvid64 (native x64 app).

For other DLLs (nvcuda.dll), the 64-bit version is in windows\system32, and the 32-bit version in windows\syswow64 (the OS will automatically load the correct one depending on the process type).

Another update. This should fix the problem with stargate.ts.

This issue was due to the lack of high-level error resilience for streams using adaptive reference picture marking (This was already on my todo list).

[attached file: nvcuvid_0809271.zip

That is awesome. Thank you so much.

I just tested it and the behavior is the same. Playing it in DGAVCIndexNV fails and never recovers. If you like I can give you the latest version of DGAVCIndexNV but it should be the same in both.

On the other issue I am waiting for the user to run the debug version I sent him to get the error code.

Yup. I've noticed this as well. I'm afraid a lot of this is due to the very limited VP2 error resilience support in the current drivers - I suspect there are some types of bitstream errors that may cause the loss of multiple consecutive decoded frames (including keyframes in this case).

There might not be much I can do in the parser to prevent this from happening, but I'll try to investigate this more in depth.

Btw, I am not seeing any kind of corruption when decoding the stream with cudecode, so I suspect the root cause is in DGAVCDecNV's demultiplexer.

Let me know if you want me to send you the correct demultiplexed elementary stream to compare against.

It appears that you are correct again. I demuxed it using Elecard Xmuxer Pro and then played the demuxed stream and it was fine. Then I demuxed it using DGACVIndexNV and played that and it failed. So I'll look at my demultiplexing code.

There was a transport packet that carried (only) PTS and DTS timestamps, and the DTS was split between that packet and the next packet, because the packet had a lot of stuffing before the PES header. I've never seen a PES header split like that and I'm not sure it's legal. Why would they put all that stuffing before the PES header? Anyway, I added handling for that and all is well now.

As you know I support multiple instantiation of the CUDA decoder. Some users are having an issue with it and I'm wondering if the problem is size of memory on the video card. As an experiment, I tried making instances on my 8500GT w/1024Mbtes memory. I could make 4 but the 5th one failed to create the decoder in the DecodePicture callback.

Do you know what the memory requirement per instance is?

It might very well be fairly high (probably at least 64MB per instance), especially since we currently rely on VMR9, which creates a bunch of unnecessary RGB surfaces. This is especially true if not re-using the same D3D device for all instances.

Got a serious issue that is pissing off users.

They want to use MEGUI, which is a popular transcoding app. I'm not sure what it is doing internally, but I know it is opening several instances of my GPU filter, which thus opens several decoding instances. What happens is that on the second instantiation g_pD3D->CreateDevice() crashes, i.e., in that function it takes an exception writing to an illegal address.

Are you aware of anything that could cause this? Maybe there is a solution? The parameters to the call look correct but it just crashes.

Is it possible that the pointer was released somehow ?

What happens if you print the refcount of the object, ie:

Int refcount = g_pD3D->AddRef();
Printf("refcount=%d\n", recount);

If the d3d call crashes instead of returning a failure, I would think that it's related to a reference count issue. If AddRef() crashes in the code above, it would also confirm that this is the case. Also, what window handle to you give in the creation parameters, and could that window have been destroyed while the d3d object was still in use ?

It happens in InitCuda() for the second instance (see below). The pointer is just created there so I don't see how it could be a refcount issue. The Direct2DCreate9() call returns a good pointer and then g_pD3D->CreateDevice() coming right after fails. The window handle is that of the desktop so I don't see how it could be destroyed.

I put your AddRef() in between the two calls and it showed a refcount of 2 as expected.

    // Create an instance of Direct3D.
    g_pD3D = Direct3DCreate9(D3D_SDK_VERSION);
    if (g_pD3D == NULL)
		OutputDebugString("ERROR: failed to create D3D9\n");
        return false;

    lAdapterCount = g_pD3D->GetAdapterCount();
    for (lAdapter=0; lAdapterCreateDevice(lAdapter,

The sample code I sent before is definitely NOT multi-instance capable: if you want to keep the g_pD3D as a global and share it across multiple decoder instances, you also need to keep track of the instance count, so that g_pD3D is only destroyed when the last decoder is destroyed.

Alternatively, you can just make InitCuda() a class member of your top-level decoder class and use m_pD3D/m_pD3DDev as class members, so that each instance gets its own d3d device.

>Alternatively, you can just make InitCuda() a class member of your
>top-level decoder class and use m_pD3D/m_pD3DDev as class members, so
>that each instance gets its own d3d device.

That's exactly what I am doing! But the second instance is failing as I described.

This happens only for a few applications that invoke my class, while others are fine. I don't know what they are doing inside that could cause this. The failure happens within InitCuda(). They are managed applications, however, maybe that is a clue.

Any ideas?

I think if you want to share the D3D device, it means that you also have to share the cuda context as well, ie: using floating contexts.

NVCUVID includes some helpers to manage floating contexts -> see cuvidCreateContextLock().

I have to go right now, but I'll send you a modified sample later today.

I'm thinking now of trying the first approach you described because I know the first instantiation of D3D always works. Will the decoder instances still be able to operate independently if they share a single g_pD3D? I've no real understanding of the D3D dependency.

Attached is the modified cuh264dec that can now use floating contexts (and could share a single d3d/cuda context between multiple instances)

Floating contexts can be enabled with:

#define USE_FLOATING_CONTEXTS 1 // Use floating contexts

Asides from the creation/destruction of the context lock object and the switch to floating contexts, the code is essentially identical with a simple addition of a C++ auto-lock object in the functions that call cuda:

// Autolock for floating contexts
class CAutoCtxLock
    CUvideoctxlock m_lock;
    CAutoCtxLock(CUvideoctxlock lck) { m_lock=lck; cuvidCtxLock(m_lock,
0); }
    ~CAutoCtxLock() { cuvidCtxUnlock(m_lock, 0); }

All functions that call cuda now have a simple local context lock object defined as:

CAutoCtxLock lck(state->cuCtxLock);

[attached file: cuh264dec[4].zip]

I tried keeping the D3D stuff to a single instance but the create context fails on the second instantiation with CUDA_ERROR_UNKNOWN. So it looks like this approach won't work.

The memory utilization is currently not very efficient, but it depends a lot on what the decoder does. All 256MB boards should be able to support two decoder instances for HD H.264 (maybe 3), but boards with only 128MB of memory will probably not support more than 1 instance (these should be very rare though, since most board manufacturer use at least 256MB).

For the worst-case, ie: 1920x1088 H.264, the memory utilization per instance should be as follow:

Misc overhead: ~40MB (VMR & DXVA1/2 misc stuff, internal decoder state)

Not sure how much the D3D device and cuda uses (could be anywhere between 2 and 10MB). In your case, you can stick with ulNumOutputSurfaces=1, since you're only using a single surface temporarily to transfer it to system memory. ulNumDecodeSurfaces can be limited to something like min(sps.num_ref_frames + 4, 16) -> no need to allocate 20 surfaces for HD if you know it's level 4.1.

I successfully ran your floating context example. Thank you for that. To test it I just duplicated the init in main().

As you use globals to store the g_pD3D, etc., the solution is limited to a single process. I am thinking that to support instantiation from different processes, I can store these globals in shared memory and retrieve them into the process space as needed. Do you think it is a workable solution?

I don't think you can share D3D object or cuda context accross multiple processes (different address spaces).

The only way this would work would be to have a 'server' process that shares the ouput YV12 data (in a memory-mapped region)

Hmm, OK. Suppose I make a COM object to act as a server and do the decoding. Can I pass a shared memory pointer to cuMemcpyDtoHAsync() or do I have to do some other fun and games to have CUDA write to shared memory, so as to avoid an extra frame copy?

I'm pretty sure that if you pass a generic pointer to cuMemcpy functions, it will end up doing a cpu copy. If you're still using IYUV as your output format, the best solution would be to put the IYUV frame in shared memory, and use the cpu to perform the conversion and frame copy in one step.

Attached is the latest version of nvcuvid.dll. 60p deinterlacing should now work properly, though the actual deinterlacing type still depends on OS, GPU & drivers (not controlled by nvcuvid). It also includes a fix for a possible COM reference count issue when destroying a decoder object.

For the multi-instance issue with managed applications, I still have no idea what could be causing this (I'll try to look around see if I can find any gotchas with D3D+managed code), but here are a few random throughts (might be easier to root cause what is going on):

1. Maybe calling CoInitializeEx(NULL, COINIT_MULTITHREADED) in your application before doing anything else (?)

2. Playing around with the window handle in the d3d creation parameters (maybe create your own dummy window, or use the desktop window handle): that's the only thing I can think of where a call from a managed application could influence d3d creation behavior (you probably don't want to pass in a window handle that has been created by the managed code).

[attached file: nvcuvid_081004.zip]

Sorry, I've been distracted by a bad bug that I had to solve first before anything else. It turns out to be a regression in nvcuvid.dll and I need your help with it, please.

With the nvcuvid.dll dated 9-23 I push my stream through, detect EOF, and do a flush. When I trace the flush I can see it outputs 3 more frames, and I see this trace of POCs at HandleDecode:

DG: poc 492 0
DG: poc 0 493
DG: poc 490 0
DG: poc 0 491
DG: poc 494 0
DG: poc 0 495
DG: poc 504 0
DG: poc 0 505
DG: poc 500 0
DG: poc 0 501
DG: poc 498 0
DG: poc 0 499
DG: poc 502 0
DG: poc 0 503
DG: poc 512 0
DG: poc 0 513
DG: poc 508 0
DG: poc 0 509
DG: poc 506 0
DG: poc 0 507
DG: poc 510 0
DG: poc 0 511
DG: poc 516 0
DG: poc 0 517
DG: poc 514 0
DG: poc 0 515
DG: poc 518 0
DG: poc 0 519

Now, with nvcuvid.dll date 9-27 and after, I do everything the same but when I trace the flush, no additional frames come out and I see this trace of POCs, which has three frames too few [(519-513)/2]:

DG: poc 496 0
DG: poc 0 497
DG: poc 492 0
DG: poc 0 493
DG: poc 490 0
DG: poc 0 491
DG: poc 494 0
DG: poc 0 495
DG: poc 504 0
DG: poc 0 505
DG: poc 500 0
DG: poc 0 501
DG: poc 498 0
DG: poc 0 499
DG: poc 502 0
DG: poc 0 503
DG: poc 512 0
DG: poc 0 513
DG: poc 508 0
DG: poc 0 509
DG: poc 506 0
DG: poc 0 507
DG: poc 510 0
DG: poc 0 511

This new behavior is a big problem for me. First, I don't know why you aren't flushing out the frames. But worse, my indexing application parses NALUs to count the frames; it doesn't decode the stream and count the frames that your decoder decodes. So, my frame server DLL ends up trying to serve extra frames at the end that the decoder cannot generate, due to the flush no longer working.

For a solution, I'd prefer that the flush work properly. Just as at the beginning we deliver frames whether they are good ones or not, we should do the same at the end. If that cannot be done, then I need a way to tell from the POCs which frames you will not deliver at the end.

I hope that is clear. And that you can help me. :-)

BTW, I tried both your suggestions for the multiple instantiation but they didn't appear to help. I'll return to all that after I fix this issue.

Can you try with setting the ulErrorThreshold value to 100, and see if the problem persists ? (updated header file attached if needed)

Sounds like I might have broken something in the parser - I'll double-check on my end.

[attached file: nvcuvid.h]

Setting ulErrorThreshold = 100 did not have any effect. It still misses the last 3 frames.

I'll take a closer look - which stream is this?

The stream is sample.264 here:

usr: guest@neuron2.net
pwd: xxxx
dir: xxxx

As I mentioned, I get 358 frames with 9-23-08 nvcuvid.dll and only 355 with subsequent versions. The flush doesn't flush anything.

Note that there is an error in the stream, but it appears inconsequential (demuxed from errored transport stream).

I seem to be getting 358 frames with cudecode when using the latest dll (same as with the old dll, but only 355 when using the 9/27 version).

Attached is the latest version of nvcuvid. I must have temporarily messed up something in the versions around 9/27, probably as I was implementing the reference frame error detection.

[attached file: nvcuvid_081007.zip]

Oh, goodie!

It works fine on my system here at work where the 9-27 version did not. I'll test on my main system at home tonight but it certainly appears to be "A-OK".

Thank you!

BTW, I have had quite a few people tell me how impressed they are with the support they see from Nvidia in my published dialog. I think you're following the Doom9 thread, because you mentioned managed applications, which I think I only mentioned there. :)

Btw, I found a major surface index management bug in the parser that may explain the random perf issue when re-creating the parser, as well as sub-par decode performance.

Essentially, the parser was delaying output of frames much more than it needs to. I have to be a bit careful when fixing this, since there are quite a few non-compliant streams out there encoded with x264 that would cause a strict decoder to output frames in the incorrect order (essentially because due to a bug in x264, re-ordered non-reference B-frames would cause a stream to go beyond level limits - the issue was discussed extensively over at doom9).

I'm hoping to have an update shortly (hopefully by tomorrow).

Interesting. It'll be nice to have an explanation for that multiple parser mystery.

I'm working on another EOF case where it seems that not all the frames get flushed. It could be in my code though, because it appears to not happen with DGAVCIndex but only with DGAVCDecode. The latter uses some event signaling that may be getting deadlocked. Still investigating...

Ok. Here is the latest version with the perf weirdness fix.

Now that I know what the problem was, it makes complete sense: there was an uninitialized variable that keeps track of the oldest displayed frame buffers indices, so it had random values when the parser was created.

If the uninitialized values happened to be large positive numbers, the buffer index would end up not being used at all, reducing the amount of buffering between decode and display, and potentially serializing decode and display operations.

This was completely random, so it depends on many factors (probably why I couldn't reproduce the multiple parser perf issue here, but I was able to reproduce the reverse problem in a different application, where the 2nd decoder instance was much slower than the first)

[attached file: nvcuvid_081008.zip]

Oh, that's awesome. It means we'll get best performance without having to rely on luck and tricks. I'll test it when I get home.

You'll be interested perhaps to hear the solution to my problem I just fixed. The stream was sending a lump of 8 PPSs every so often. When I seek, though, I was just injecting the previous 5 known PPSs, so the decoder was silently discarding the frames with orphaned PPSs, and when I reached the end of the stream, my app thought there were more frames to come but your decoder couldn't provide any more. I have to rethink my heuristics for SPS/PPS injection on a seek. For now I've bumped it to inject the last 10 prior to the seek point. I should be able to think of something a little more intelligent however, as I do have the stream indexed at that point and know about all the PPSs.

I spent two days being confused because I thought the problem was caused by end of stream handling, which looked fine, when all along it was caused by start of stream handling!

Please use this attached version instead: it fixes a D3D device object leak on XP with previous versions of nvcuvid.

[attached file: nvcuvid_081008_2.zip]

I made a new release with the latest DLL and my fixes. Everything seems OK. Thank you for your support.

I've been working on my server solution. To get the frames I have to go through this:

Server process:
1. Copy device memory to host memory using cuMemcpyDtoHAsync().
2. Copy host memory to shared memory with YV12 conversion.

Client process:
3. Copy shared memory to client memory.

So, the shared memory communication between processes forces an extra frame copy. Can you think of any way to improve this situation?

I realized I could use cuMemcpyDtoH() to copy straight to the shared memory. It seems to be working fine. I'll let the client receive NV12 and do any required conversion.

I just wanted to give you some feedback on the NV12 to YUY2 conversion you sent me. It works great!

I combined it with a YUY2 to RGB24 conversion based on a lookup table and I have a really fast NV12 to RGB24 conversion now.

I have my server coded and also a primitive client. Together they prove that it all works and using separate processes doesn't lose performance.

Now I will implement the client side in my applications and see if the problem with multi instance is thereby solved (it should be).

Cool. I'm still surprised that we need to go to an independent process in order to get multi-instance working (I would think that sharing the d3d object would make this possible for multiple instances within a single process - multiple instances in multiple processes shouldn't be an issue since there should be zero state shared between multiple processes).

Well, it would be great if I could make it work that way.

The problem is that when I make the D3D handles global in my Avisynth DLL, I still see them as NULL when I get to the second filter instantiation. That suggests to me that MEGUI loads another instance of the DLL each time it opens the Avisynth script and zeros the handles.

So I tried manually setting the handles to the ones generated on the first instance, but when I did that the cuD3D9CtxCreate() call crashed.

So without source code I've no hope of figuring it out.

It probably means it's a separate process then. Interesting that D3D9Create crashes even though the application is in an entirely different process (shouldn't be aware of other app's d3d usage)

I wasn't sharing the globals properly. They had to be static member variables. I confirmed it is one process.

So now on the second instantiation, I see the D3D variables already set and skip the D3D instantiation. But when I do the CUDA context create, passing the existing g_pD3Dev, it fails with CUDA_ERROR_UNKNOWN.

Any ideas?

You should be able to bypass the cuda context creation as well (using the same cuda context for all instances). The only thing required is to properly manage the floating contexts, which should be fairly easy using cuvidCreateContextLock (Take a look at the modified version of cuh264dec that has a USE_FLOATING_CONTEXTS option - let me know if you want me to resend it)

OK, I'll try that tonight. I still have the code but thanks for the resend offer.

I'm confused about the floating context stuff.

Recall that I have multiple Avisynth filter instances. As coded now, each one has its own DecodeSession structure, D3D, etc. I see that you have put the lock in the DecodeSession. But I was thinking you have to have a common lock. So does that mean I need to share the DecodeSession as well? That can't be right because the parser is in there. And should the CAutoCtxLock class be instantiated for each instance?

So, I see what you suggest in theory, but don't know how to implement it in detail.

That's correct: thet lock handle should be the same, shared between all sessions. IIRC in the sample, I believe the lock was created only once, even though it was stored along with every DecodeSession state (think of the lock handle as a simple event handle).

The CAutoCtxLock is just a helper to call lock/unlock (similar to the way a critical section would work, ie: CAutoLock in DirectShow base classes)

Gosh, it seems to be working. I was able to encode a stream with MEGUI.

MEGUI is really goofy though. He opens the filter instance 7 times! Sometimes he opens and closes it so fast that the filter destructor gets called while CUDA is still opening! I had to put a delay in the destructor to work around it.

I have two questions for you. First, I note that you do NOT set vidLock when you create the decoder as you suggested to do earlier. What is the reason for that?

Second, if I exit MEGUI while an instance is waiting on a lock I get a crash. Any advice about that, given that I can't control what MEGUI does on termination?

The lock is really a simple critical section. If you set it in the decode parameters, it will automatically acquire the lock when calling cuvidMap/Unmap, but since the app is acquiring the lock by itself prior to calling these functions, it's not really necessary.

For the crash when exiting MEGUI: is it possible that you're destroying the lock while another thread is waiting on the lock ? If so, one way to solve this would be acquire the lock before destroying it.

I've been thinking and trying to decide which architecture to go with and in the process of experimenting may have uncovered a bad problem in nvcuvid.dll. Let me set the scene and then tell you what is happening.

So, I'm thinking, ideally, I want to support an Avisynth script like this:


This script is entirely run from one thread. But each AVCSource() instance instantiates its own D3D and cucontext. The file to be decoded is a series of frames with the numbers 0 to 95 on them. The output of the script then is as I want:

95 0 94 1 93 2 ...

...which tells me that the two instances are operating independently and correctly. Apparently its OK that they both execute from the same thread as long as the filter instance directs things to the right D3D/context. I do NOT use floating contexts here.

Here's the rub: This works with the nvcuvid.dll dated 9/10/2008. But if I leave everything else alone and just change to nvcuvid.dll dated 10/8/2008, then the result is:

95 [application crashes and disappears when stepping to the next frame]

If the operation above is supposed to work then it is now broken.

I want to be able to support independent decodes by more than one filter instance in a script like above. Obviously that will require multiple D3D/cucontexts, one per filter instance, but it may all be run out of one thread ID.

So, is that supposed to work? If not and I was just lucky with the earlier DLL, then I have to abandon the idea of supporting multiple independent decodes and just go with my server architecture which is working fine, but precludes the multiple independent decodes.

The problem with managed applications is another issue. I need to get the above working again first to even have any hope. Note that using floating contexts does not help, that crashes too.

The bottom line is that if this cannot work, then I might as well just go to the server model, which is simple and robust.

Your thoughts would be appreciated.

I'll take a look - it certainly shouldn't crash (that would definitely be a bug).

Btw, do you know which call crashes ? (cuvidDecodePicture or MapVideoFrame ?)

Sorry, I didn't trace it. I can do that tonight.

But is this a scenario that SHOULD work? I.e., multiple D3D/contexts from one thread.

Yes. It shouldn't cause any problems (at least that's the theory).

Attached is the latest nvcuvid - just in case this was related to a problem that was previously fixed.

Keep in mind that using multiple D3D/Contexts from one thread will only work with floating contexts (otherwise, cuda pointers from one instance may end up in a different context, leading to unpredictable behavior).

[attached file: nvcuvid_081023.zip]

That fixed the crashing. Thank you!

Unfortunately, the crashing on second instantiation with managed apps is blocking me from using the floating context solution. Can you think of any way to further debug that?

Also, you mentioned that you now support full bobbing. How would I implement that?

I had an MBAFF clip that wasn't being deinterlaced and I found that the frames are returned with progressive_frame = 1, which stops them from being deinterlaced. That seems like a bug to me.

Also, you mentioned that full bobbing is now implemented. How would I get that working, as I have only adaptive deinterlace and field discarding bob available now from the header file?

Finally, is there a solution for VC-1 decoding with CUVID?

BTW, my CUVID server and client are available with source code here:


A bug, yes, but it sounds to me like a bug in the encoder :) I haven't seen the clip, but the logic to determine progressive_frame is as follow:

Progressive_frame = (!field_pic_flag) && (FieldOrderCnt[0] == FieldOrderCnt[1]);

Feel free to use a different logic, as this flag only matters for display (not decode). MBAFF does not imply that all frames are interlaced: in only implies that 'not all frames are progressive', as this can be used to code mixed content. The relative value of {Top|Bottom}FieldOrderCnt has the following meaning:

FieldOrderCnt[0] < FieldOrderCnt[1]: interlaced, top field displayed first
FieldOrderCnt[0] == FieldOrderCnt[1]: progressive (both fields displayed simultaneously)
FieldOrderCnt[0] > FieldOrderCnt[1]: interlaced, bottom field dislayed first

I'm assuming you mean 60Hz (or double frame rate) deinterlacing. This can be implemented by calling cuvidMapVideoFrame() twice: the first time with second_field=0, and the second time with second_field=1. The actual deinterlacing mode is still currently going to be the driver default (may be different on different GPUs and OSes), but it would allow you for example to convert 30Hz interlaced to 60Hz progressive (better than dropping one of the fields).

The latest dll I sent you should already work with VC1 in MPEG-2 transport streams (or VC1 advanced profile elementary streams). If you support VC1 in you transport stream demultiplexer, you should be able to feed the elementary stream directly to the parser, and have almost identical data flow than for H.264 or MPEG-2.

I think your logic for progressive_frame is fine for FRAME and PAFF, but for MBAFF, it can be on a macroblock basis. So what I did was change progressive_frame to 0 if the frame is marked as MBAFF and the deinterlacer is turned on.

Thanks for the double rate deinterlacing solution.

And good to hear that VC-1 works already. I'll give it a test drive.

MBAFF can decide field or frame at the macroblock level, but that is unrelated to the frame being progressive or interlaced (this is purely for compression efficiency and does not indicate the type of the underlying content). The output timing of the two fields determines if the frame should be displayed as progressive or as interlaced (otherwise you might unnecessarily deinterlace progressive frames in mixed content, such as 3:2 pulldown).

If an encoder generates TopFieldOrderCount==BottomFieldOrderCount for an interlaced frame, it's definitely a bug in the encoder. This also means that there is no way to know if the content is top-field-first or bottom-field-first.

You're talking about CurrFieldOrderCnt[0] and CurrFieldOrderCnt[1], right.

I have an x264.exe-generated interlaced content stream with all frames marked MBAFF. CurrFieldOrderCnt[0] and CurrFieldOrderCnt[1] are always the same. Another guy in ST told me that the order for MBAFF just follows the coding order of the data.

Is there any spec reference that can confirm your view?

> You're talking about CurrFieldOrderCnt[0] and CurrFieldOrderCnt[1],

Right. Yes.

That is almost certainly a bug in x264 (unless it needs a separate command line parameter for tff/bff indication). Delta_pic_order_cnt_bottom should be present in the bitstream, in order to properly indicate tff/bff for interlaced content (unless this stream contains pic_timing SEIs, which I doubt is the case).

The 14496-10 spec touches a bit on the subject (See section 8.2.1, where it describes TopFieldOrderCount and BottomFieldOrderCount). These values essentially indicate the display order of the top and bottom field (delta_pic_order_cnt_bottom).

You have to keep in mind that the only difference between a progressive frame and an interlaced frame is that the two fields were captured at different points in time in the case of an interlaced frame (if there is no motion, there is no difference whatsoever between the 2). If TopFieldOrderCount==BottomFieldOrderCount, it means that both fields should be displayed simultaneously, which implies that this is a progressive frame, unless the output timing is controlled by a different layer (SEI).

If unspecified by the encoder, the only alternate way of determining the correct field order (and if the frame is interlaced), would be to look into the pic_timing() SEI, if present in the elementary stream (pic_struct and ct_type can be used to infer the correct value).

My understanding was that for MBAFF the order is specified by PIC timing SEIs and if they are absent then the field order follows the coding order.

Correct. In this case the coding order indicates that it is progressive (TopFieldOrderCount==BottomFieldOrderCount -> top field displayed at the same time than bottom field == progressive)

Hmm, OK, it sounds reasonable.

I'll look at some of my broadcast MBAFF streams and see what they do just out of curiosity. I'll also make a bug report against x264.exe.

Oh my gosh, you're right. The first broadcast MBAFF stream I looked at had TopFieldOrderCount != BottomFieldOrderCount.

So do I work around x264.exe's bug? Seems to me if the frame is flagged as MBaff and the user has enabled the deinterlacer, it may be reasonable to assume it is interlaced.

I can probably put in a workaround within nvcuvid, since it's highly unlikely that good streams would use delta_pic_order_always_zero=1 in an interlaced sequence. (I also need to enable looking into the pic_timing SEIs if present anyway - I'm currently only using it to detect repeated fields in 3:2 pulldown content)

Can you share the stream, so I can take a closer look ?

Give it 10 minutes, it's uploading.

I'll look into the stream to see if we can do a safe detection without unnecessarily enabling deinterlacing for mixed content...

I had a quick look at the stream, and it should be trivial to fix this in x264, by simply setting pic_order_present_flag=1 in the PPS, and adding delta_pic_order_cnt_bottom in the slice header (1 for tff, -1 for bff, 0 for progressive).

Since virtually all "good" streams will have pic_order_present_flag=1 for mbaff streams, I could put in a hack in NVCUVID to assume that if the field ordering is unspecified in an interlaced sequence, I would them assume that it is interlaced, top_field_first (this would obviously break if x264 is used to encode bottom field first content, though).

You can actually do this yourself in DGAVCDec, since you have access to all these flags in the picture parameters:

if ((pp->CodecSpecific.h264.pic_order_cnt_type == 0)
 && (!pp->CodecSpecific.h264.pic_order_present_flag)
 && (pp->CodecSpecific.h264.MbaffFrameFlag))
  // this condition should never be true in 'good' streams
  // override the display flags as interlaced tff or bff if frame marked
as progressive

Attached is a version of nvcuvid with the hack (with limited testing), hardcoded to assume tff in this case.

[attached file: nvcuvid_081029.zip]

The hack appears to work fine. I will keep an eye open to make sure it doesn't cause any regressions.

Does the hack only kick in if the order is not coded AND pic timing SEIs are not present? I assume you would do it that way but if you could confirm it I'd be grateful.


I mapped and delivered the frame twice toggling second_field as you suggested and it works great!

Here's a heads up for you from one of my users.

"just to let you know, I saw a driver set on nvidia's site http://www.nvidia.com/object/winxp_180.43_beta.html

Version: 180.43 which crashes your server app on my setup anyways.. back to the GeForce Release 178 WHQL Version: 178.24 and all is good. i know they were beta and this is just for reference in case anyone else has issues..."

I hope the next release doesn't pull the rug out from under me.

I noticed this morning that for a 1080i clip I have, if I decode (without display) without the deinterlacer I get ~48 fps. When I enable the single rate deinterlacing, it drops to ~18 fps. That seems like a very heavy overhead to me. Is it normal?

It's not too surprising if it uses very high quality deinterlacing modes on a 8500GT (might also be content specific), but usually these modes are not turned on by default - I'll see if there is a quick fix.

Sometimes we may want the high-quality modes. Can it be user selectable?

What would be the best card for me to get for maximum performance of CUVID and postprocessing?

Actually, I just realized that it's completely weird: you should be getting virtually identical perf for single-rate and double-rate deinterlacing: nvcuvid always does double-rate internally -> if you don't ask the the 2nd field, the frame is just thrown away.

I didn't say anything about double rate deinterlacing. Just no deinterlacing and single rate is about 48 versus 18. I agree with what you said.

What is the perf if you set the deinterlacing mode to Mode_Bob instead of Mode_Adaptive ?

Regarding the beta driver issue:

I've been able to reproduce this issue, but I'm not sure it is a driver issue yet (may be related to a driver-related change in the pitch of video surfaces, though)

The crash occurs when the server app calls cuMemcpyDtoH(pMsg->nv12_frame, devPtr, nv12_size);

How does the server knows how many bytes were allocated for the 'nv12_frame' pointer ? Is it possible that 'nv12_size' is greater than pitch*(height+height/2) ?

Yup: it looks like nv12_frame is declared as

Unsigned char nv12_frame[(3*MAX_WIDTH*MAX_HEIGHT)/2]
With MAX_WIDTH=1920

but it should be:

Unsigned char nv12_frame[(3*MAX_PITCH*MAX_HEIGHT)/2]
With MAX_PITCH=2048 (in this case, pitch is 2048).

I verified that changing MAX_WIDTH to 2048 fixes the problem (rebuilding both client & server)

Thanks for that analysis. I've increased the max values in my applications as needed.

I've just finished the double rate deinterlacing support. It was a bit tricky for random access. I'll release that with the size fix.

>What is the perf if you set the deinterlacing mode to Mode_Bob instead
>of Mode_Adaptive ?

They are the same at ~18 versus ~48 for no deinterlacing. That is with the display disabled.

I'm still a bit puzzled by this perf, as this means that it's not even realtime anymore (we should see the same problem during normal playback) I'll try to get a hold of a 8500GT and see what happens (maybe the perf is due to the unnecessary d3d->cuda transfers ?), I'm definitely not seeing this on my 8800GT (G92).

Out of curiosity, do you also see similar perf issues with the cudecode/cuh264dec samples ?

>this means that it's not even realtime anymore

Yes, that's why I inquired about it!

And it's why I am asking for your recommendation for a card upgrade, to see if it is a quirk of the 8500GT. If I am going to buy a different card I may as well getthe one with the best performance.

I can try with cudecode/cuh264dec tonight.

Note that the figures I gave you are from DGAVCIndexNV, which uses code similar to cuh264dec (not the server). I can send you an unlicensed version of that if it would help you.

I have a strange problem with the double rate deinterlacing. I implemented it as you described. First, second_field = 0, map, copy to host, unmap, display frame, then, second_field = 1, map, copy to host, unmap, display frame. At home, this works fine and I get successive fields as expected.

Now, on my machine at work, with the same video card, same nvcuvid.dll, same DGAVCIndexNV executable, and same input file, the second frame of each pair is always the same as the first frame! Stepping through the code shows it operating just as at home, but with this unexpected result.

Maybe there is an undefined in the library that is coming up random or something to cause this. Can you think of anything?

In my earlier version using two interleaved decoder instances, with one set for second_field and one not, a user reported this same problem, which I attributed to some qiuirk of the multiple instances. Today is the first day I ran double rate on my work machine, and I was shocked to see it happening there with the single instance version.

I picked up a 9600GT at lunchtime today. I'll do some performance tests with it tonight.

You may need to install WindowsXP SP3 for this to work properly (There was some VMR9 deinterlacing bugs in SP2 that MS fixed in SP3).

Aha! The one factor I didn't think of. Thanks. I'll try it and let you know.

BTW, is the second_field variable already normalized for field order? So, for second_field = 0, I'll get the top field for TFF and the bottom field for BFF?


I bought a BFG 9800GT OC card and the performance kicks the 8500GT's butt. Everything runs real time and, strangely, the fps loss when I enable the display is less too.

The only problem I have with it is the card makes a strange subtle ticking noise when it is busy. E.g, when I am on a page and roll the scroll wheel, I hear a subtle ticking coming from the graphics card. What could that be? It's pretty irritating. I'm close to returning it and exchanging it for a PNY.

But more seriously, I have this issue reported from two users:

I've muxed out a .264 file (from a Bluray disc with tsmuxer), and when I tried to load it the following error message is appeared:

GPU decoder: Failed to create video decoder [100]

CUDA driver, sdk, toolkit is installed, i've started the server application, and the dll file is in the right place (Windows\SysWOW64)

My pc details:
OS: Vista Business x64 SP1
CPU: AMD Athlon 64 X2 64 6000+
GPU: 8800GTS 320MB (NV50/G80 revision A2)

Have you any idea what might cause the "Failed to create video decoder [100]" error?

Yup. Sounds like cheap shielding by the board manufacturer (or cheap shielding on the soundcard side)

IIRC, the GTS320 doesn't have VP2.

I've noticed that one frame remains salted away somewhere when the deinterlacer is enabled and I am unable to flush it. When I do my reset and start pushing NALUs again, the hidden frame from the previous decoding comes out and I cannot find a way to flush it.

Any ideas?

The unflushable frame that I mentioned in my last mail appears to be in the D3D instance somehow, because I can get rid of it by killing the server and re-starting it.

Would you know if there is a way to flush it out with a D3D call? If I kill and remake the D3D instance, I'm going to get hit by the crash on second D3D instantiation problem again.

I've seen this problem as well: what is happening is that when this particular high-end deinterlacing method is used, it unfortunately introduces one field delay, which does suck (not a problem for normal playback, but a big issue for a generic API like NVCUVID).

So far, I haven't been able to get rid of this, but if your only concern is being able to flush it, one way to do so is to start at second_field=1 after a discontinuity.

Ultimately, the best way to get rid of this is to perform the deinterlacing within NVCUVID using a cuda kernel, rather than relying on VMR9, with sometimes some rather obscure side effects, but this requires fairly significant amount of changes in nvcuvid, so it's going to take a while.

Aha, I confirm what you say. It has nothing to do with unsuccessful flushing.

If my field sequence shown in [] delimited frames is:

[a b] [c d] [e f] [g h] [i j] ...

Then when I deinterlace double rate, I get this sequence of bobbed frames:

X a b c d e f g ...

Where X is a leftover (field now a frame) from the end of the last decode sequence. The first field is just garbage.

And yes, your suggested strategy looks good. It's a complication for my random access code so I'll have to implement it carefully.

On the card making noise: I found that on the card there is a little piezoelectric speaker called a BUZZ1 speaker. It's supposed to sound an alarm when the card has too low a voltage, but the bad design makes tiny chirps during normal operation. A little plug of hot glue on the exit hole of the speaker works wonders!

VC1 Advanced Profile defines I, P, B, BI, and Skipped. The last one should result in a P frame that is a copy of the reference frame.

I have found that I do not get decode picture callbacks for these skipped frames, and so they never make it into the display queue.

But it's even more mysterious. I have a VC1 ES with these skipped frames. If I count pictures from your decoder, it is always the correct number minus 2*number of skipped frames. The correct number is what I get by just counting 00 00 01 0D start codes, which corresponds to the displayed frame count when played in a player.

So, this makes problems for me in doing proper random access.

Your thoughts would be appreciated.

The frame delta is just the number of SKIPPED frames, not 2*SKIPPED as I said earlier. It was an artifact of my automated line counting.

So we are just missing callbacks for the SKIPPED frames.

Sounds like there might be a bug in the parsing code. Can you point me to a stream that uses skipped frames ?

I found the problem: pictures that were too small were being discarded by the parser. I already have a fix for this, but this particular stream also exposed another problem that only repros on Vista (actually only on Vista+DXVA2, Vista+DXVA1 works fine, which is very puzzling since nvcuvid is essentially sending the same exact data).

I'll keep you updated.

Oh cool, thank you!

Then DGVC1DecNV will soon be a reality. I have the indexer component done and just need this fix to finish off the Avisynth source filter component.


[attached file: nvcuvid_081124.zip]

I had a big computer crash due to a stupid trojan. So haven't had time to test thoroughly. Will do today. Preliminary test looks OK.

I have seen some other anomalous behavior with 700 errors, FAILURE_TO_LAUNCH. May send test stream for you to look at.

I'll have to double-check, but I think the decoder will return error 700 if it encountered a bitstream error (not as tolerant of bitstream errors as MPEG-2).

I always get them at the beginning or end of stream, so I'm suspecting bad cuts. I'll advise if it starts looking like a parser or decoder bug on your side.

Btw, the attached binaries fix some problems decoding some VC1 interlaced content.

[attached file: nvcuvid_081201.zip]

I released the first version of my VC-1 CUDA enabled tools. A user has sent me a stream that decodes with a lot of macroblocking. It plays fine using a DirectShow decoder. Would you be able to have a look at it please?

user: guest@neuron2.net
pwd: xxxxxx
dir: xxxxxx
file: MC.vc1

Our emails must have crossed - this stream uses field picture, so the problem is most likely fixed with the 12/01 dll I just sent you (I'll double check)

I'm still seeing macroblocking with the latest DLL.

Hmm. That's odd, I'm not seeing any problems here using cudecode - can you post a screenshot ?

Also, do you also see the curruption with the cudecode test app ?

I have attached a BMP but I cannot try the cudecode until I get home this evening.

I'm seeing this too. I assumed it was artifacts due to high quantization in dark scenes, but maybe that's not the case -> I'll double check with the reference decoder.

OK, thank you. I have Sonic Cinemaster DirectShow decoder and it plays it without the blocking.

Looks to be a bug when computing the deblocking filter strengths - I'm pretty sure I know what the problem is, I'll keep you updated.

The problem turned out to be something else & much more difficult to track, but it's finally fixed. Latest binaries attached.

[attached file: nvcuvid_081124.zip]

Seems to be working fine, thank you very much indeed.

Now I'm feeling guilty for making you work so hard for that one, but I take solace in at least helping you to improve the quality of the decoder in some small way. :-)

Don't feel bad: you're the best QA ever for nvcuvid.

Btw, I noticed that the client/server project files link to cudart.lib -> this is most likely unnecessary (only need cuda.lib). CudaRT is the cuda runtime library, only used if you're writing cuda kernel that require this functionality, and may introduce an unnecessary dependency on cuda toolkit components.

Basically, the only thing needed to run the app should be a CUDA 2.0 driver (>=178.xx+), and nvcuvid.dll (can be in the same directory than the app) -> no need for other cuda components, except for compiling (I was reading the DGVC1IndexNV thread at doom9, so I just thought of that). Also, the beta 180.xx drivers have some known cuda issues under Vista64 and may also have a memory leak on XP that could cause the decoder creation to fail (will be resolved in the official driver).

Well, I was having a nice week-end, and figured I'd check over the DGVC1IndexNV thread over at doom9, until I saw that screenshot from 300 with the corruption, which I immediately recognized as something that went wrong in the prediction of chroma DC coefficients, so I knew that was definitely a bug in nvcuvid.

My previous fix for MC.vc1 introduced this problem as an unfortunate side effect. The good news is that the fix was trivial.

The attached version of nvcuvid has the proper fix (hopefully should be the last problem with VC1 content).

[attached file: nvcuvid_081207.zip]

Thanks very much for that. I just got back from my birthday dinner and your fix was waiting there!

I've updated the ZIPs but couldn't post about it as Doom9 is down. I'll post about it when it comes up again.

There may still be an issue with the VC1. If you could have a look at the clip mentioned here, it would be great. I have duplicated the effect.


I'll take a look. Sounds like a similar problem than the previous issue (maybe another corner case - vc1 is a mess :)

Turned out to be an 'interesting' problem - I'd be curious to know what encoder was used to produce this clip (I suspect it is a side effect of the encoder using a brute search selecting poor motion vectors, though it is still technically compliant with the VC1 spec, the problematic motion vectors are very inefficient from a RDO point of view).

Anyway, I have a workaround for this problem (latest nvcuvid attached).

[attached file: nvcuvid_081213.zip]

Thanks a bunch. Seems to do the trick.

I'll try to find out what encoder was used.

I'm told it came from "The Dark Knight" blu-ray disc authored by Warner.

Interesting. I thought for sure the studios would have used something more fancy for their encoding, but I guess at these bitrates, they probably don't care too much about the compression efficiency (the only explanation for these particular motion vectors is that they're using a simple brute-force pure SAD-based search like back in the good old MPEG-2 days).

I guess they're mostly on pre-processing and psy-based optimizations. Not necessarily a bad thing, but an interesting observation.

User reports and I confirm there's still one pink square in the top left corner of frame 168.

I know you're a perfectionist so I report that. :-)

Crap - I should have done more testing - I was only analyzing the first 100 frames. (I'm pretty sure I know what the remaining problem is too, I fixed the problem for 16x16 motion vectors, but completely forgot about other modes)

Yup. Looks like something else is going on with this clip - I'll need to run more in-depth tests on Monday.

Ok. I now have a much better understanding of the problem. The pink block thing is really something that should be addressed in the driver, but can be worked around in nvcuvid. Unfortunately, my previous workaround caused more problems that it solved :)

The attached version should resolve the pink blocks without the unfortunate side effects.

[attached file: nvcuvid_081214.zip]

You're allowed to rest on the weekends. :-)

Thank you for your latest version. I couldn't see any issues but I have submitted it to the Doom9 guys, who usually turn up issues pretty quickly if they are lurking there.

New blocking issue reported with beta driver:


I'm not seeing this here. This looks like it is using the wrong version of nvcuvid.dll

Btw, we fixed the root cause of the original issue in the driver, but it won't make it for the 180.xx release. Since I now have a much better understanding of the problem, I modified nvcuvid (attached) to fully get around any corner cases (in theory, there might have been some remaining issues even with the previous version).

[attached file: nvcuvid_081217.zip]

Thank you for the latest binaries. I've updated the distributions.

As always, your support is greatly appreciated!

Been a while. Hope you are doing well.

I have a special need to support Matroska files.

When I push a NALU I have the timecode for it. I need that timecode at picture display time. Ideally, there would be some extra data that would be carried along with the pushed packet that I could read in the decode picture callback and store for use in the display picture callback. I don't see a way to do that. Maybe there's a way that I have missed?

I tried parsing the POC and storing the timecode by POC when I push the NALU. It almost works. The problem is that for short GOPs the POC restarts at 0 during the decode ahead (MAX_FRM_COUNT is 16), and overwrites my timecode.

I hope I have explained the issue clearly. I'd appreciate any idea you may have.

The packet timestamp should ideally work for this purpose, though the parser will also interpolate timestamp values, in case a frame doesn't have a timestamp.

(Make sure to use the proper units in CUVIDPARSERINITPARAMS.ulClockrate, for example use 1000 if your timecode is in milliseconds)

Thank you for your response.

I don't understand what you mean by saying a frame doesn't have a time stamp. How does it have a timestamp in the first place? I don't know of any AVC syntax for it.

I also don't understand what happens to the timestamp that is sent in the packet. For example, for one frame I send SPS, PPS, IDR NALUs separately. Which call is the timecode taken from? As an experiment I sent different codes for the SPS and the IDR NALU pushes; it seems to have taken the timecode from the SPS push.

Your clarification would be greatly appreciated.

It should pick the last timestamp that was received before the NAL start code of the first slice of the picture (note that the packet timestamp is consireded valid only if the CUVID_PKT_TIMESTAMP flag is set in the packet flags)

I think there was a bug in previous nvcuvid versions, where if there was multiple timestamps to choose from, it should have picked the first one instead of the timestamp associated with the frame (in your example, it incorrectly picked the timestamp of the SPS NALU instead of the timestamp of the IDR) -> I can send you an updated nvcuvid, but if you only set the timestamp for the first NALU of a picture, it should pick the correct one)

Thanks, it all works great now that I pass a timestamp only on the picture NALUs.

I had implemented a kludge to do this without using the timestamps (a circular buffer for timecodes with the NALU pusher as the writer and the decode picture callback the consumer). It was tres yucky but worked.

I added a printf to print the result of both ways and they are always the same now. So I will delete my kludge and use the timestamp way as it is much cleaner.

Am I good at finding your obscure bugs, or what? :)

Cool! Nothing gets past you :)

On a side note, the next official display drivers should now include nvcuvid.dll as part of the standard driver package.

Super. Then all I have to say is:

"Buy an Nvidia card and install the latest drivers."

You're making my life too easy.

I may have found another bug related to timestamps.

First, my notation. A line like this is a NALU push:

DG: 65 -> 50550000000 1

The 65 is the NALU type byte so that's an IDR. The long number is the time code (in nanoseconds). The last number is 1 if timestamps are enabled for this push. The ulClockRate is 1000000.

A line like this is what is seen at picture display callback:

DG: 50550000000 50550000000

The first timecode is determined using my circular buffer hack which I think is working perfectly. The second is read from pPicParams->timestamp. They should always be equal.

Now, here is the trace causing an error. At the first line I have just seeked to an IDR and both timestamp methods agree. Then I start a play operation. At the end the pPicParams->timestamp suddenly gives a weird value: 51151000033. There's no reason for it. It comes just after a string of P frames when B frames start appearing again.

Of course the NALU pushes are in coding order and the display lines in display order.

Do you know why it happens?

DG: 50550000000 50550000000
DG: 65 -> 50550000000 1
DG: 41 -> 50584000000 1
DG: 41 -> 50617000000 1
DG: 41 -> 50650000000 1
DG: 41 -> 50684000000 1
DG: 41 -> 50717000000 1
DG: 41 -> 50750000000 1
DG: 41 -> 50784000000 1
DG: 41 -> 50817000000 1
DG: 41 -> 50851000000 1
DG: 41 -> 50884000000 1
DG: 41 -> 50917000000 1
DG: 41 -> 50951000000 1
DG: 41 -> 50984000000 1
DG: 41 -> 51084000000 1
DG: 41 -> 51051000000 1
DG: 1 -> 51017000000 1
DG: 41 -> 51218000000 1
DG: 50550000000 50550000000
DG: 41 -> 51151000000 1
DG: 50584000000 50584000000
DG: 1 -> 51117000000 1
DG: 50617000000 50617000000
DG: 1 -> 51184000000 1
DG: 41 -> 51351000000 1
DG: 50650000000 50650000000
DG: 50684000000 50684000000
DG: 41 -> 51284000000 1
DG: 50717000000 50717000000
DG: 1 -> 51251000000 1
DG: 50750000000 50750000000
DG: 1 -> 51318000000 1
DG: 41 -> 51485000000 1
DG: 50784000000 50784000000
DG: 50817000000 50817000000
DG: 41 -> 51418000000 1
DG: 50851000000 50851000000
DG: 1 -> 51384000000 1
DG: 50884000000 50884000000
DG: 1 -> 51451000000 1
DG: 50917000000 50917000000
DG: 41 -> 51618000000 1
DG: 50951000000 50951000000
DG: 41 -> 51551000000 1
DG: 50984000000 50984000000
DG: 1 -> 51518000000 1
DG: 51017000000 51017000000
DG: 1 -> 51585000000 1
DG: 51051000000 51051000000
DG: 41 -> 51651000000 1
DG: 51084000000 51084000000
DG: 41 -> 51685000000 1
DG: 51117000000 51117000000
DG: 65 -> 51718000000 1
DG: 51151000000 51151000000
DG: 41 -> 51785000000 1
DG: 51184000000 51151000033

Are you sure that the ulClockRate value was 1000000?

The 33 looks like it comes from having ulClockRate set to 1000 (33ms). For nanoseconds, it should be 1000000000.

However, this means that the decoder had to interpolate the timestamp for this particular frame, which means that the original timestamp got thrown away by the parser.

Was this an open GOP by any chance? This is most likely related to the same bug that cause the SPS timestamp to be picked.

Um, yeah, why did I have ulClockRate set to 1000? :-)

So I changed it to 1000000000 and now the discrepancy comes out as 51184366666 when I expected 51184000000.

It's a closed GOP:


The error comes with the last B shown above.

Should I try your most recent DLL?

I attached the official nvcuvid.dll - hopefully this will solve the problem.

Note that I think the 51184366666 is actually a correct timestamp interpolation if the frame rate of the stream is 29.97Hz: I suspect the timestamps you're getting from the container are actually in millisecond precision (thus the last 6 digits always being zero when converted to nanoseconds).

At 29.97Hz, the delta between two consecutive timestamps should be 33366666ns

The new DLL didn't change anything.

Yes, the container has only ms resolution. But why is nvcuvid interpolating when I sent the timestamp for that frame?

I suppose I can round to the nearest ms multiple, but then it's almost as yucky as my first solution.

It shouldn't - I thought this was due to the timestamp association bug in the old DLL (I guess not). I'll need to debug this internally and figure out why the parser dropped the timestamp.

Btw, could you send me a log of the packet sizes (along with the timestamp flag) ?

I *think* I may have found the problem, but I'm not entirely sure -> can you run with the attached nvcuvid.dll, it should display a message box with some debug info if it encounters the bug (and let me know what it says, if anything shows up).

When I get it I get debug popups like this:

dropped pts (new@870, cur_pts_valid=1@0)!
dropped pts (new@5574, cur_pts_valid=1@0)!

Here's the length dump. I only printed the ones with the timestamp enabled. Do you need the rest too?

DG: type=65 len= 27704 tc=50550000000 enable_tc=1
DG: type=41 len= 19528 tc=50584000000 enable_tc=1
DG: type=41 len= 16259 tc=50617000000 enable_tc=1
DG: type=41 len= 14992 tc=50650000000 enable_tc=1
DG: type=41 len= 18308 tc=50684000000 enable_tc=1
DG: type=41 len= 17741 tc=50717000000 enable_tc=1
DG: type=41 len= 14890 tc=50750000000 enable_tc=1
DG: type=41 len= 20643 tc=50784000000 enable_tc=1
DG: type=41 len= 17350 tc=50817000000 enable_tc=1
DG: type=41 len= 18665 tc=50851000000 enable_tc=1
DG: type=41 len= 18601 tc=50884000000 enable_tc=1
DG: type=41 len= 18212 tc=50917000000 enable_tc=1
DG: type=41 len= 15517 tc=50951000000 enable_tc=1
DG: type=41 len= 19373 tc=50984000000 enable_tc=1
DG: type=41 len= 7277 tc=51084000000 enable_tc=1
DG: type=41 len= 3051 tc=51051000000 enable_tc=1
DG: type=1 len= 610 tc=51017000000 enable_tc=1
DG: type=41 len= 4658 tc=51218000000 enable_tc=1
DG: 50550000000 50550000000
DG: type=41 len= 961 tc=51151000000 enable_tc=1
DG: 50584000000 50584000000
DG: type=1 len= 412 tc=51117000000 enable_tc=1
DG: 50617000000 50617000000
DG: type=1 len= 207 tc=51184000000 enable_tc=1
DG: type=41 len= 2704 tc=51351000000 enable_tc=1
DG: 50650000000 50650000000
DG: 50684000000 50684000000
DG: type=41 len= 686 tc=51284000000 enable_tc=1
DG: 50717000000 50717000000
DG: type=1 len= 329 tc=51251000000 enable_tc=1
DG: 50750000000 50750000000
DG: type=1 len= 157 tc=51318000000 enable_tc=1
DG: type=41 len= 4661 tc=51485000000 enable_tc=1
DG: 50784000000 50784000000
DG: 50817000000 50817000000
DG: type=41 len= 918 tc=51418000000 enable_tc=1
DG: 50851000000 50851000000
DG: type=1 len= 554 tc=51384000000 enable_tc=1
DG: 50884000000 50884000000
DG: type=1 len= 330 tc=51451000000 enable_tc=1
DG: 50917000000 50917000000
DG: type=41 len= 3911 tc=51618000000 enable_tc=1
DG: 50951000000 50951000000
DG: type=41 len= 446 tc=51551000000 enable_tc=1
DG: 50984000000 50984000000
DG: type=1 len= 489 tc=51518000000 enable_tc=1
DG: 51017000000 51017000000
DG: type=1 len= 317 tc=51585000000 enable_tc=1
DG: 51051000000 51051000000
DG: type=41 len= 2276 tc=51651000000 enable_tc=1
DG: 51084000000 51084000000
DG: type=41 len= 1859 tc=51685000000 enable_tc=1
DG: 51117000000 51117000000
DG: type=65 len= 21126 tc=51718000000 enable_tc=1
DG: 51151000000 51151000000
DG: type=41 len= 8983 tc=51785000000 enable_tc=1
DG: 51184000000 51151000033

Can you try the attached version ?

[attached file: nvcuvid_090128.zip]

That one runs clean!

Tried on a few files and couldn't create the problem.


Cool! Hopefully I can make the change in time for the next official driver :)

This one is a bit serious.

For random access, I have to reset the decoder this way:

int decoder_reset(void)
    CUresult result;
	int i;

	if (Session.cuParser != NULL)
		Session.cuParser = NULL;
	// Create video parser
	memset(&parserInitParams, 0, sizeof(parserInitParams));
	parserInitParams.CodecType = Session.cuCodec;
	parserInitParams.ulMaxNumDecodeSurfaces = MAX_FRM_CNT;
	parserInitParams.ulErrorThreshold = 100;
	parserInitParams.pUserData = &Session;
	parserInitParams.pfnSequenceCallback = HandleVideoSequence;
	parserInitParams.pfnDecodePicture = HandlePictureDecode;
	parserInitParams.pfnDisplayPicture = HandlePictureDisplay;
	result = cuvidCreateVideoParser(&Session.cuParser, &parserInitParams);
	if (result != CUDA_SUCCESS)
		return -1;
	// Flush display queue
    for (i = 0; i < DISPLAY_DELAY; i++)
        Session.DisplayQueue[i].picture_index = -1;
    Session.display_pos = 0;

	return 0;

I have found that every call to this leaks some memory. If I use reverse() in my Avisynth script to force every frame to be a random access, cuvidCreateVideoParser() quickly fails its malloc and the CUVID server crashes. Before that happens I can see the memory for the CUVID server increasing.

Can you have a look at this please? The leaks will accumulate over time and kill the server.

I'm pretty sure I found the problem - let me know if the attached version fixes the issue (I haven't been able to verify it locally)

Btw, I'm not sure this fix will make it for the initial driver update, but there might a simple workaround which would be to send a dummy EOS packet to flush the parser (it should *in theory at least* fully re-initialize the parser, though doing so may cause a few more calls to Decode/Display picture callbacks that can be ignored during the flush)

[attached file: nvcuvid_090208.zip]

That fixes my issue. Thank you!