As someone else pointed out, there is a good example of this on NVIDIA's site, where Shirley's code was ported to CUDA. You can use that website and its many links/examples as a reference, but it sounds like you want to learn CUDA as well. If that is the case, then I'd recommend https://www.amazon.com/CUDA-Example-Introduction-General-Purpose-Programming/dp/0131387685
I wrote it while I was at NVIDIA and have a chapter in it on RayTracing. The target audience for that book was "people who know C and want to learn how to leverage CUDA", which I think is what you are looking for. The book was translated to 8-ish languages and used in Universities around the world, so I like to think it helped a lot of people learn CUDA.
(I'm not recommending it because I get any money from it, I don't. I recommending it because I wrote it for people like you, so I hope it helps!)
I was almost in the same position a few months ago. It sounds like I have a lot more programming experience (Delphi at school, Java for university, Perl/Bash/Python for fun, PHP :-( and Python for work, etc), but I never really got into C or C++.
I started working on a C implementation of my project right away and spend far too little time reading books and learning stuff. And although I only worked on it 2-3 days a week everything went quite well, I made huge progress and I am almost done.
So, here is what I recommend:
learn proper C - my biggest problems with CUDA were memory-management and -optimizations. Really knowing pointers and in particular pointer arithmetic helps so much
start learning CUDA right after that - the parallel part is quite easy to grasp once you worked through a few basic examples
I read 'The C Programming Language' by Kernighan & Ritchie for the C part and really worked through it. There are some things in it that are not 100% up to date but apart from that it's highly recommended by almost everyone who uses C. I then read 'CUDA By Example' by Sanders & Kandrot. It's not up to date at all and I'd do a lot of things differently (the ugly HANDLE_ERRORS macro for example), but it gives you a good start and if you play around with the code examples you get a good feeling how CUDA works and what you can and should do/use.
I hope that helps. I am by far no expert but if you have any more questions, feel free to pm me
The Udacity course is a great way to learn. It consists of multiple lessons in the fundamentals and more advanced topics later on. At the end of each lesson is a parallel problem set which puts what you learnt to the test. Usually this is some form of image manipulation which is a lot more exciting than drilling algorithms. Good luck and feel free to get in touch if you need any help.
Well, that is the question. I thought I had to be wrong as these bigger and faster and more powerful cards were released they seemed to have broken FP64 functionality. They can push around a dumb game or 32-bit pixel math just fine but the engineering and scientific FP64 side of life is broken to hell. I notice this because one of the big Amazon EC2 compute options is the EC2 P2 Instance with a pack load of the old old K80 cards :
https://aws.amazon.com/ec2/instance-types/p2/
However the latest drivers from NVidia are dropping support on these older cards and customers can move to a much more expensive EC2 P3 instances with the V100 Tensor Core GPU cards. Those do NOT have broken FP64 circuits. Everything else does have broken FP64 all the way back until you hit the old old Keplar cards which drivers are dropping.
Just to clarify again - you dont't need Nvidia GPU to compile CUDA code. So you can set up as 1) students compile code on their laptops 2) upload their binaries to host with Nvidia GPU 3) collect results.
However if you want you can set up one machine to do all the compilation too. Many modern IDEs (and even text redactors) has an ability to browse, edit and run commands remotely. Pick your poison and Google something like "remote development". I.e VS code has a collection of plugins for that - https://code.visualstudio.com/docs/remote/remote-overview
Coursera's offering is comparable as well. Mine isn't meant to be a full treatment, just a quick start guide with some interesting discussion (it is a blog, after all).
Well with Pytorch you can just check your model to see whether the layers you're training are float32 or half precision (float16), that's easy.
Then there are a number of flags to set within Pytorch that could have an impact on your CUDA performance / behavior. I would check the available cuDNN flags (https://pytorch.org/docs/stable/backends.html?highlight=cudnn#torch.backends.cudnn.benchmark), as these can effect performance and behavior of the CUDA backend. I would also look at the Pytorch Forum, a number of members of the Pytorch dev team are active there and are extremely helpful.
If you want to learn CUDA for performance then you need to understand the fundamentals. For that by far the easiest resource is the Udacity Introduction to Parallel Programming course, mentioned already. I don't think OpenMP is the way to go if you want to learn CUDA, I would learn parallel programming for CUDA with CUDA. I found the resources mentioned here useful.
I've been sucessful calling a function with C# from a c++ dll with this tutorial .
Is there more to do, if I now want to get a c++ function running, that calls a Cuda function inside?
Edit: btw, why do you use "__cdecl" in your code?
I think you need the P3 instances if you want to get access to the V100 cards.
They get expensive when you start using multiple cards, but if you're using a single card, it's not too bad ($3/hr on-demand). For 8 cards with NVLink it's $24/hr on-demand, but that's over $80k worth of hardware, so it's still a pretty good deal for single runs.
I use Gnuplot for all my plotting and from C++ call it via gnuplot-iostream. It does contour plots no problem.
The API isn't amazing, but it does a ton of different plot types and you can call it from absolutely anywhere so you're not having to learn a new plotting suite and install a ton of dependencies for every language. Simple plots are simple and complex plots are possible.
In your environment can you run
conda list
and then share the output for the pytorch line ?
I'm on linux and my lines could look like
pytorch 1.7.1 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
You are looking for cuda or cudnn in the line and not cpu, can you confirm that line does not contain cpu?
For windows, pytorch.org/get-started/locally/ recommends some thing like; do you have conda-forge enabled?
conda -c create -n ENVIROMENTnAME pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
I cloned the cupy repo and tried installing with Git Bash. It was first giving me an error because the /cub subfolder in the cloned repo was empty so i copied in the contents of the cub folder from the link you provided, which I believe contains required header files. After doing so the setup starts but now it's giving me this error:
building 'cupy_backends.cuda.api.driver' extension error: Microsoft Visual C++ 14.1 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
This doesn't make sense because I already have Visual C++ 2015-2019 Redistributable 14.27.29016 installed.
EDIT:
I don't know what I did differently, but instead of installing from the source package I installed from the binary package for my version of cuda with:
pip install cupy-cuda110
I tried this before and got an error but it worked now, weird.