Inference Engine: Accelerating with CUDA

Sep 15, 2024

Still crunching numbers, but faster.

2 Comments

Neat! I haven't done modern GPU programming, but I wonder if you can do some analog to double-buffering, which is a standard technique for display, especially when I/O plays a role.

The basic idea is to do the calculations on one set of the data, while doing I/O (or display) on the previous iteration. You then swap where you're doing things (once you're done with I/O, or ready to display the new frame).

In this case, you'd be, e.g. using some of the half of the GPU units to be calculating, and the other half (which had older, finished calculations) would be dumping their results back to the CPU. The trick is getting the two phases roughly similar in terms of time, otherwise it's not a big benefit.

I guess it's really just another view on pipelinining.

Anyway, thank you for an interesting and well-written article!

Expand full comment

Reply (1)

Michal Pitr

Sep 20Edited

Glad you liked it, Ben!

Right now the engine IO looks something like this, expressed in python pseudo code:

```

results = []

for mini_batch in inputs:

__ session.set_input(mini_batch)

__ session. run()

__ results.append(session.get_output())

```

Main IO cost at the moment is the get_output() call, which is synchronous and very slow. One way to make it faster is using pinned memory. I've done some simple tests and didn't get much of an improvement though - could be user error.

What you suggest is interesting, how I understand it:

* Initiate a transfer from GPU to CPU buffer

* Make the transfer non-blocking so that processing of the next input can start

* Only real dependency is that previous output transfer has to finish before the next output transfer is initiated.

* Once transfer is complete, have another process/thread do post-processing, i.e. print to terminal or return to user via network.

Thanks for the discussion! Always super cool to see people interested in this type of stuff!

Expand full comment

From Scratch

Inference Engine: Accelerating with CUDA