Writing an efficient (and parallel) matrix-matrix multiplication is absolutely non-trivial and not beginner friendly. It's a research topic, that has taken several full PhD's to do right.
What you are running into is that threads will access memory all over the place, which will invalidate the cache - each thread will empty the cache that the other threads just loaded and prepared and thus ruin the runtime for the others. They are basically competing for resources.
I suggest you to not try to write you own matrix-matrix multiplication but instead use some of the advanced BLAS implementations, e.g. Intel MKL, Eigen or Blaze
Before you start learning OpenMP it's important you know how the computer does stuff, otherwise it's easy to write algorithm that are actually slower in parallel. I suggest you read some books on the topic, e.g. High Performance Computing for Scientists and Engineers or similar. A fundamental understanding of the computer may also benefit, which can be obtained by reading Computer Systems: A Programmers Perspective
If you do want do do some introductory OpenMP exercises I suggest you to try to parallelize some algorithms where threads do not access memory in the same regions, e.g.:
std::vector<double> list_of_stuff = ...; double integral = 0.0; #pragma omp parallel for reduction(+:integral) for (int i = 0; i < n; ++i) { integral += some_time_consuming_function(list_of_stuff[i]); }
I found this book an interesting and useful read on HPC - "Introduction to High Performance Computing for Scientists and Engineers" https://smile.amazon.com/gp/product/143981192X
>In my one CS class, to test the performance of an application, we were generally instructed to set a variable with a #define and compiled the program with various -D inputs to test the performance. Normally this would be the array size or the number of threads. Is this the accepted best practice or the only way to do this with OMP or CUDA?
That's a rather silly method of doing scaling testing. In particular, the number of threads in OpenMP can be controlled via an environment variable or adjusted at runtime. For CUDA, it depends on what you are trying to test. It's very easy to adjust the block and thread dimensions at runtime: just call the kernel with different configurations.
On a more general note, there are many different metrics for "performance" in HPC. What you are describing here is more of a scaling test. That is, you are testing to see how much parallelism your code can give you. Another important aspect of performance testing is to profile your code. This is very easily accomplished using gprof. Of course, fixing the hotspots in your code once you find them is a discussion in and of itself. Because you are in mechanical engineering, I assume you will be doing a lot of finite element work. If that is the case, I highly recommend taking a look at Victor's excellent <em>Introduction to High Performance Scientific Computing</em>. There is a free, online PDF version of the book. If you only purchase one book, let it be <em>Introduction to High Performance Computing for Scientists and Engineers</em>. If you have other questions along the way, post them here or over at /r/HPC. Scientific computing is a rich field. To do well, you need an even richer set of skills and knowledge from areas of computer architecture, software engineering, performance analysis, software optimization, large-scale computing systems, and not to mention the domain-specific knowledge of you problem. Don't be afraid to ask lots of questions!
jacobolus' recs are very good. you can also google som eHPC resources:
c++ is going to take, um, a lot of time, read the sidebar in /r/cpp. For c, Zed shaw's book, Head First, 21st century, etc, there's lots of books
languages: http://snir.cs.illinois.edu/PDF/Programming%20Languages%20for%20HPC%20short.pdf
a good primer: https://bitbucket.org/VictorEijkhout/hpc-book-and-course/src. Texts by Hager/Wellein and Levesque also received good reviews: http://www.amazon.com/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X
convex optimization: http://stanford.edu/~boyd/cvxbook/
the important parallel libs: MPI, openMP, CUDA, openCL. Wrox Press "Professional Cuda programming" is very good, I'm not familiar with others.