I don't personally know of any kind of MPI certification.
There are plenty of courses online about parallel programming, and MPI is usually a part of that.
For example, the Courseara Intel Parallel Programming and Udacity HPC courses have an MPI section, among others.
There's also latency issues, for example Intel's FMA takes fewer cycles than AMD's. Or at least they did at the time of this talk - Native Code Performance on Modern CPUs: A Changing Landscape.
You don't. Laptops are not for HPC. They're great for writing code on; not so good for running it. You want to know how well your code scales up with number of CPU cores, which you can't with only two cores.
I recommend something online, e.g. Sabalcore or Amazon EC2. That way, you've got access to a grunty computer with you anywhere you have WIFI.
You'll need to read up on hyperthreading. If you're measuring how well your code scales per core, I recommend disabling it to avoid weird performance results.
Oh, to answer your question directly: hyperthreading will give you -10% (i.e. less performance) to 20% more performance depending on how badly you abuse the cache and how often the CPU stalls on memory access.
You don't need to rent anything. You can create the virtual machines on whatever you currently have. If you have Linux around this is very easy. If not use something like VirtualBox. Just create however many VM's as you have the memory for and off you go. I would suggest a lot of little Linux VM's because their memory requirements are very small, as low as only 256M each which gives you room for many "nodes".
The job submission method Kubernetes is just referred to as "jobs" https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
The problem is there's not really much of a HPC style scheduler or queue system to go with it. You can use node or pod affinity, taints and tolerations to manipulate where a job ends up. But in Kubernetes, the scheduler is there to determine what nodes the pods land on. There is a concept of bring-your-own scheduler in Kubernetes but I haven't seen many examples or if that could turn into something that could be similar to slurm or Hadoop's YARN.
My other issue is even if I were to have my researchers use k8s jobs, the learning curve is quite high compared to current HPC solutions.
I can't speak to number 2 because even xeon machines can be sketchy. But for number one: http://i.imgur.com/xfIx6F6.jpg That is a http://www.amazon.com/gp/product/B000234VZK You can twist tie it to the phi and it has enough pressure to turn it into a blown card.
Where did you install the PMIX rpm from? Elrepo? The rpmbuild should find pmix automagically if you did. Have you reviewed the output of rpmbuild looking for where it finds the PMIX library? Feel free to post it here for review: https://hastebin.com/
AWS ParallelCluster costs only as much as the total # of compute instances, attached storage, etc so for a very rough estimate of compute costs only you can check the cost of EC2 instances. I make it around $10 per day per node for a 16-core, 32GB machine. Add additional costs of storage and you'll have a good idea of the minimum cost.
1 week per month feels on the edge of being worth it but you can cost it yourself and find out. Do let us know what you decide on!
Also, just in case you haven't yet considered it, most HPC centres (in the UK at least) offer HPC services to industry. Might be worth checking locally to see if that's a cost-effective route?
Although the "raw" resources spread across one computer lab may seem significant (totalling maybe hundreds of cores, perhaps a TB of RAM) those resources are heavily constrained by limited connections speed + latency between machines, poor cooling/energy efficiency and limited access to shared storage. Importantly, a code would also have to support being run on multiple, independent machines (and the performance of such a code would have to efficiently scale to tens or hundreds of machines).
In comparison, a single, high-memory node in a cluster can have 80 cores, 1 TB of ram and very fast storage, so a code can utilise close to the same resources of a large, idle computer cluster without even beginning to think about multi-machine performance. Given the difficultly in developing an efficient MPI code, it seems more trouble that it's worth.
HOWEVER, for problems where you can run a hundred independent little problems (e.g. monte carlo simulations, raytracing, some large, linear problems) it could work well. AFAIK this is what folding@home does. I'd still be worried about energy efficiency though (perhaps someone else can comment on this?) and if my simulation was about to be killed because someone needed access to the cluster...
Cases like boost-mpi and boost-python are use-cases that we haven't run into in Guix yet. We would need to add new packages for those. Once that is done, you could use command-line package rewriting to substitute mpi implementations. For more complicated package customization, Guix makes it easy to define your own local packages and treat them as first-class packages.
If you have more use-cases that you'd like to see addressed, just let me know.
Lots of layers here. Since you are concerned with system health the first thing to do is research the actual hardware you have to see what you can get out of things that the vendor or manufacturer may have supplied -- IPMI interfaces, out of band management control, BIOS settings that can log hardware events etc. IF you have spendy servers from a Tier1 manufacturer than that company probably has some monitoring stuff that can be used.
No matter what method you end up using you should start first with understanding what hardware you have and what types of things you can use to extract info and state|health|status data from it.
After that you can get into linux land and into things that report on the health of disk drives and such. Linux system logs will have useful things in there as well although you'll have to do a lot of filtering. There are specific linux utilities that will check the health of hard drives et.
This is not really a "I don't know Linux well" thing though! Especially if "prediction" was mandated.
You may want to double check with management to get them to more clearly define what they want you to monitor -- for instance it would be more typical to give a "novice" the task of figuring out a monitoring system for the HPC grid that is not strictly hardware focused -- more along the lines of a Ganglia or Nagios install that sets up a dashboard showing the full HPC grid and lots of pretty graphs of CPU load, uptime, network traffic over NICs etc.
Actually https://www.nagios.org/ (the free open source version) could be a good starting point for something that can monitor both HPC as a whole but also perhaps collect and monitor hardware events and logs that would show hardware issues . Nagios is only a suggestion as there are tons of monitoring solutions out there.
The problem described here sounds like defrag on allocation. There are flags in sysfs to control THP defrag efforts (see https://www.kernel.org/doc/Documentation/vm/transhuge.txt), in particular you can (and probably should?) "defer" defrag. It could be that your workload is eventually causing memory fragmentation and then being slowed down by the time taken to defrag/compact in-use pages before satisfying new page allocations.
That makes sense. I am mostly interested in academic/science-related HPC jobs, less so commercial HPC jobs. The takeaway for me is a bit mixed. I am trying to see if a volunteer computing platform like BOINC (https://boinc.berkeley.edu/) would be helpful for reducing the national backlog for academic/science-related HPC jobs. It is unclear to me if this is an issue that needs solving or not though as more than half of the comments on this thread seem to indicate that the backlog is usually no more than 5 days or so (and, even on a top supercomputer, a wait of 40+ days, which might be acceptable to most scientists).
​
I think running an HPC job on a volunteer computing platform might be possible, but it would definitely take longer than it would on a supercomputer to produce the results, so I think it would only be valuable if the time to produce the results through a volunteer computing platform would be less than the wait time to run the same job on a supercomputer.
I see, so if the demand is higher than the available time by a factor of 2 or 3, and the top waiting job time is 5 days, that doesn't seem to be a very big issue or time lag, even for the most demanding of HPC jobs.
​
In terms of running the codes more slowly, I was thinking of the hypothetical case where, if you were to run the HPC job in the backlog through a volunteer computing program like BOINC (https://boinc.berkeley.edu/) (not currently possible), I think it would only be worthwhile if you could get your results back is less time than it would take to wait for the backlog for the actual supercomputer to clear. If the wait time is only 5 days, then perhaps that is not an actual issue that could hypothetically be solved by a volunteer computing HPC platform.
I am trying to see if there is an actual need for something like BOINC (https://boinc.berkeley.edu) to be able to run HPC workloads, rather than high throughput only workloads or if this is not a problem that needs alternative solutions outside of requesting supercomputer time, which sounds like it might be readily available.
There's certainly middleware out there that can help with CPU scavenging, but be aware that whatever you run on the nodes needs to be resistant to node failure. As already suggested, HTCondor is one option. Another one is BOINC, which was initially created for the Seti@Home project.
For reporting it's probably not so useful but I would like to add the following Prometheus Exporter: Slurm exporter
A Grafana dashboard is also available on the main Grafana web site: slurm dashboard
What are you trying to do with it? Programming Massively Parallel Processors was useful to me, but without more info, it's hard to make recommendations.
If you had to pick one thing, what do you recommend for a beginner to learn? CUDA, OpenCL, OpenACC, OpenMP, MPI?
I've been working through Numerical Methods for Engineers and Scientists by Hoffman, I skipped a bit ahead to things that interest me more, but haven't come across anything yet that can be parallelised. I've been implementing stuff in C++.
If you had to pick one thing, what do you recommend for a beginner to learn? CUDA, OpenCL, OpenACC, OpenMP, MPI?
I've been working through Numerical Methods for Engineers and Scientists by Hoffman, I skipped a bit ahead to things that interest me more, but haven't come across anything yet that can be parallelised. I've been implementing stuff in C++.