stability+speed for Opteron/NUMA?

Quick to answer questions about finding your way around Linux Mint as a new user.
Forum rules
There are no such things as "stupid" questions. However if you think your question is a bit stupid, then this is the right place for you to post it. Stick to easy to-the-point questions that you feel people can answer fast. For long and complicated questions use the other forums in the support section.
Before you post read how to get help. Topics in this forum are automatically closed 6 months after creation.
Locked
PattiM

stability+speed for Opteron/NUMA?

Post by PattiM »

I'm looking at some different distros and am surprised that they have different stability on a SuperMicro Opteron 6386SE box when Turbo Mode (in BIOS) is enabled. I'm wondering if there is a general consensus of distros for HPC that someone could help me find. I see Mint is tops (and has been for a long time) on DistroWatch. I also noted that OpenSuSE 12.3 crashed with no warning after running fine for an hour or so on this box. OTOH: CentOS 7 LIVE distro never crashed. (But they both give kernel panic when Turbo Mode is enabled.) A little snooping on google showed that different distros have waaaay different kernel support, and I'm looking for some good advice. Can a wizard please help? I really like Mint for some reason, especially LMDE (it seems like Debian is about as pure Linux as you can get, and that's sort of been a lifelong dream for me).

Many Thanks!!
Patricia
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

Hi Patricia
could you expand a bit more on "HPC" , do you mean High Performance Computing?
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

Thank you very much for the reply :-) Yes, sorry - computational fluid dynamics, actually. Using OpenMPI on multiple cores. For at least a couple of different codes, including Weather Research and Forecasting Model (WRF) and NASA/GISS ModelE2 general circulation model. These are big FORTRAN codes. So, although this box is billed by SuperMicro as a "server" - I'm more interested in FLOPS and memory bandwidth. I think a lot of people use "servers" like this - for HPC. So its not about handling multiple users, but about OpenMPI running many processes, and interprocessor communication. The Full Load Turbo on the 6386SE's is 2.8GHz -> 3.2GHz, which is not insubstantial. I tried running 64 threads of Primex64 on this machine (presumably at 2.8GHz) and it didn't heat up at all, so I'm pretty sure even at 3.2GHz it would be within TDP/temperature limits. So I guess that's what I'm after, but also not "committing" to a system (i.e., distro) that's going to generate headaches. In playing around with CentOS7 today, I realized how heavily dependent on YaST (in OpenSuSE - I've been using OpenSuSe for years) for setting up the system - I couldn't figure out how to turn on the wifi! I guess any time you change distros, there is a learning curve... but there is also the kernel/motherboard/CPU support to consider - and that is usually inadequately documented, it would seem.

One of the reasons for choosing AMD is that these types of codes don't do well with extra "threads" - they need extra physical cores. If I turn on 4 OpenMPI threads on an i5 (=2 physical cores), these codes run slower than using only 2 threads. AMD has problems, but they do have cores!

Patricia
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

gosh I love you already :lol:

I used to build big HPC and farms for major companies around the world.
Also never really had to deal with a GUI doing this? or wifi (latency between units would be so high that even the best numa or openmpi would just not work.
So you are just trying to extend parts of your code to message passing interface on a single machine?
use open mpi and bind to numa (numa is better with threads)
also this is more an end user forum, you should post on http://www.open-mpi.org/
on a local machine I would probably get more juice if using Fortran OpenMP extension as I would like to enforce the use of NUMA local memory with threads that are permanently bound to the same CPUs and use the CPU cache as much as possible.
Do you have access to the fortran code or do you have somekind of influence on the coding part?
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

Hi Pat: Thanks for the reply! :D I have to compile the codes, and I've been using gfortran+OpenMPI+netcdf363. I've tried the openmpi mailing list - not so useful. Some of the codes have aspects of OpenMP in them as options, but the heavy lifting is always done by domain-decomposition parallelism.

Just one standalone box (so the wifi thing was just me sobbing about not having a Yast-look-alike anywhere except OpenSuSE...). The main reason for this post was that I was seeing different stability for different distros on this SuperMicro 64-core box. I was really surprised that different distros would be different that way. And I was hoping someone would have wise words about things like the BIOS: NUMA support, HPC support, CPB (core performance boost) support, etc. (and possibly how they relate to the Kernel Panic I was seeing). The OPTERON 63xx chips have been out for a long time, so I would expect all the distros to support them fully by now.

EDIT: I do remember coming across numactl (and one related command I forget) to "bind" processes to specific CPU's and their local memory. I think OpenMPI also supports binding. But there's more to it which I haven't yet learnt - basically, you want adjacent pieces of the computational domain on the same processor, insofar as possible, so they only have to pass data within the socket, rather than going out across a limited HT bus. I'm not sure how up-to-date this image is, but notice 2 "jumps" are needed to get from, say, CPU1 to CPU4.

http://www.google.com/imgres?imgurl=htt ... g&tbm=isch

I have not figured out how to identify which threads contain which parts of the solution domain in order to profit from numactl or from the MPI "binding." But I've been told that the modern linux kernels are pretty good at doing such things on the fly. But I don't know how or where to actually check if that is true or not... :)
Last edited by PattiM on Sun Aug 24, 2014 4:18 pm, edited 2 times in total.
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

I do know a couple of people at supermicro, do you have the exact model?
also GUI and other stuff are not really meant to run on a "server" box, graphics are bare minimum. You may want to go for a "workstation" model.
That's what my engineers love to remind me of, so they can also "play" on these units on top of crunching big numbers or preparing HPC rollouts.... :lol:
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

PatH57 wrote:I do know a couple of people at supermicro, do you have the exact model?
also GUI and other stuff are not really meant to run on a "server" box, graphics are bare minimum. You may want to go for a "workstation" model.
That's what my engineers love to remind me of, so they can also "play" on these units on top of crunching big numbers or preparing HPC rollouts.... :lol:
(edit: attached the wrong pic - here's the right one)
http://elnexus.com/mail/hyper-transport.gif
...I just assumed it would have fast HT, fast memory access, and solid performance.

Thanks again - I'm getting ready to send this unit back to SuperMicro (I'm working with "Jerry" on their tech support via email) and it's clear that the kernel panic and lack of sync of CPU nodes during boot are pointing to a hardware defect. Prime, Sandra, and Memtest 5.1 all run OK so it's not memory. I always install a separate graphics card, and just use it to set up a problem then log off and let it "crunch" - so I'm not *stressing* the system with KDE (effects turned off) - no games - that's done on my home Phenom II black box... ;-) (Work is insane busy all the time...)

Interestingly, opensuse 12.3 always crashes after about an hour (sudden reboot, as if I hit the "reset" button) but CentOS7 ran prime95 overnight, using 64 CPU's and 490GB memory. Why would these two distros be so different in stability??? Both get kernel panic if I turn on the BIOS Core Performance Boost (to try to get up to 3.2GHz). And both distros run cool as a cucumber (according to SuperMicro's temperature monitor utility) so I know it's not overheating. Very weird!
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

wild guess but they may all use a different kernel version?

Code: Select all

uname -r
also boosting cpu is fine under normal use, but running all the cores at higher speed does also put a lot more stress on the memory bus.
I will check tomorrow what version of the kernel we use (it's a vanilla with some mods and we compile all drivers for it)
Time for the old man to go to bed. see ya soon.
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

PatH57 wrote:wild guess but they may all use a different kernel version?

Code: Select all

uname -r
also boosting cpu is fine under normal use, but running all the cores at higher speed does also put a lot more stress on the memory bus.
I will check tomorrow what version of the kernel we use (it's a vanilla with some mods and we compile all drivers for it)
Time for the old man to go to bed. see ya soon.
I know there are different kernel version update histories. I think CentOS7 just switched to version 3.x of the kernel but Ubuntu has had 3.x for several releases. The SuperMicro tech support indicated they think it's a mobo or CPU problem, so I'm starting by swapping motherboards. If it is, it will be the first time a motherboard wasn't working quite right. Hopefully, that is a fix :-) ...but it worries me that about 20% of the time the CPU nodes (sockets, really) reported not "syncing"... scary!

Linux Mint always has the latest kernel (rolling distro) or is that just LMDE? Do you think LMDE would be more or less stable in an application like this than LMDE?
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

LMDE is using 3.11 Mint/Ubuntu uses 3.13
I'm using 3.15.10 and find it quiet stable so far, will try 3.16.2 this week.
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

PatH57 wrote:LMDE is using 3.11 Mint/Ubuntu uses 3.13
I'm using 3.15.10 and find it quiet stable so far, will try 3.16.2 this week.
Hi Pat
That's surprising - I thought Debian was a "rolling" distro and always had the latest and greatest. :oops:
Is there any consensus you might know of concerning the stability of LMDE vs Mint?
Apparently, Ubuntu 14.10 uses 3.16...
http://en.wikipedia.org/wiki/List_of_Ubuntu_releases
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

I think it's more a matter of personal taste and latest is not always greatest (it has also to be stable)
I usually don't go for the latest but wait for 1 or 2 revisions on production units, for my own ones off course I can't resist trying a new kernel but I can always boot back to the old one if something doesn't work as expected.
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

Hi Pat: I hope all is well with you! The saga continues... I just tried booting the latest LMDE, and the System Monitor in Cinnamon claims I only have 32 CPU's. Hmmm.... Other distros show (correctly) 64.
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

Hi Patty,

from memory LMDE has kernel 3.11 so that could be it.
let's have a look on the unit.

Code: Select all

sudo inxi -U

Code: Select all

inxi -Fxz
you may have to install inxi
PattiM

Re: stability+speed for Opteron/NUMA?

Post by PattiM »

Hi Pat: I've been booting between live distros - and right now I'm in Mint KDE live - it seems pretty stable and it's showing all processors. I'm running two instances of Prime95 - one with 19 CPU's and one with 9 for a total of 420GB memory usage. It's been running for an hour, so it seems pretty stable. I think Mint KDE had the latest kernel version of any of the distros. I'm going to have to let it run for a while to verify memory and CPU's before I ship the box back to the vendor (it won't run stably, or even boot most distros, with core performance boost enabled). Also, no matter how many threads I run, the system fans never speed up and the motherboard monitor software shows the CPU's don't warm up. This is unreasonable - so something is really wrong. The TDP per CPU is the same as another machine and its fans speed up whenever I crunch numbers on more than ~half the CPU's. So it *must* be a motherboard or chipset problem...

I am leaning toward Mint KDE from CentOS. It seems like CentOS's user forum is not heavily used, and there's no Scientific Linux user forum at all!

PS: I just discovered something. If you hover the mouse over the CPU utilization graph in the System Monitor, it will pop up a stacked list of CPU's and their %ages. This would be REALLY handy to have all the time on the desktop (GKrellm is the only other one I've seen which will do this - but it's difficult to configure) - I wonder if there is a way to get that to be the preferred display mode for System Monitor?
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

the system fans never speed up and the motherboard monitor software shows the CPU's don't warm up
NUMA is mostly about memory usage so if your testing is memory intensive but not CPU intensive that could be normal (assuming ACPI and sensors report it correctly) does the case feel hot?
you could use htop (package manager) to monitor or pipe the output to a text file, with the amount of threads you have, a graphical display could take a lot of space on screen.
PattiM

Off Topic...

Post by PattiM »

I've been getting way paranoid lately about the cybercrime world. I was even looking at "secure" linux distros. So here's my question: how is the *nix community protecting and policing itself? I'm pretty sure that upstream developers have a tacit method for preventing intrusion by cybercriminals (e.g., installing backdoors and trojans in kernel elements), but I've never really verified that. Maybe this general subject is not talked about openly. Like the T*ue*r*pt thing.

How would you know of a cybercriminal were trying to infiltrate a distro? A decade ago I helped build packages for OpenSuSE 11.x (bacula, before it went Pro, mainly) with pgp signaturing and all, but am I just paranoid or has the cyberworld changed? Back then we didn't hear about organized cybercrime corporations in Russia, or things like "McD**pals"

I guess I need to read Krebbs more...
Patricia

EDIT: After hitting "post" I sort of realized that focus these days seems to be switching toward physical security - card readers/skimmers and such. But also "d*mps" - I already stopped using ATM's. What really has me freaked out is t*uecr*pt - if it was indeed to prevent an enforced backdoor installation, where does that leave Windows? I bought a new laptop that has Windows8 installed on it (and you can't downgrade the Windows because no drivers) and I'm literally scared to use it. I would estimate a double-digit percentage possibility that there are hardware or software skimmers already installed per gov't request... NONE of us have actually read the P*tr*ot act - there is a lot of stuff in there.

EDIT2: (do I sound paranoid enough? :lol: )

EDIT3: OK, LM17KDE just finished installing on my thumb drive... time to reboot :)
PatH57

Re: stability+speed for Opteron/NUMA?

Post by PatH57 »

EDIT2: (do I sound paranoid enough? :lol: )

EDIT3: OK, LM17KDE just finished installing on my thumb drive... time to reboot :)
definitely 8)
I'm actually more worried abut internet security then plain security on my unit...
Locked

Return to “Beginner Questions”