Turns out I was wrong and right at the same time.
I thought that the problem Thomson Reuters had discovered (detailed in the last post) was that Hoard’s spinlock implementation for Sparc was somehow broken, possibly by a newer version of the Sparc architecture (e.g., by using a horrible relaxed memory model). See this info from Sun about such models, including TSO and (yikes) RMO.
Suffice it to say that relaxed memory ordering breaks things in addition to being absolutely awful to reason about. But luckily, apparently saner minds have prevailed and that memory ordering — while supported by the Sparc architecture — is never enabled. Phew.
Anyway, while chasing down the bug I discovered an “impossible” sequence of events (a race, but under the protection of a lock), and switching from spinlocks to POSIX locks (slower, but safe) solved the problem. Aha! Plainly, something wrong with the spinlock! But, it turns out, the spinlock code is perfectly fine. It’s practically identical to what Hans Boehm does in his atomics library.
Next time both Hans and I have done things the same way, I will assume that Hans probably got it right, so I’m OK. 🙂 The real source of the problem was elsewhere, but it points up the perils of supporting cross-platform software (and legacy systems).
First, a little background.
Hoard has an optimization specifically for single-threaded programs. Unless you spawn a thread, Hoard does not enable any locking code or atomic operations (locks become simple loads and stores). IIRC, this increases performance by several percentage points for some programs, so it’s nothing to sneeze at.
It’s simple enough: Hoard has a Boolean value called anyThreadCreated
, which is initially false. The Hoard library intercepts the pthread_create
call (on Windows, it does something else that has the same effect). Among other things, any call to create a thread immediately sets anyThreadCreated
to true. The spinlock implementation then enables real locking.
As you can imagine, if somehow a program were to run multiple threads with locks disabled, that would be bad.
Enter Solaris threads.
It turns out that Solaris has not one but two thread APIs. They were the predecessor to the now-familiar POSIX threads API. However, some code still uses this old, non-portable API. Notably, the code running at Thomson Reuters.
Yeah, I knew about Solaris threads, since I programmed with them “back in the day” (in the mid to late 90’s), but I overlooked them.
What was happening was that Hoard was not intercepting thr_create
(Solaris threads). It thus assumed that no threads had been created, even though they had. So Thomson Reuters’ multithreaded code was running with the spinlocks disabled.
No wonder it crashed. It’s surprising it worked at all.
So, now Hoard now properly intercepts thr_create
. Bug fixed. Life is good.
That said, I still think that exposing programmers to the vagaries of hardware memory models should be a felony offense.
Daniel Jimenez observes that “there are advantages to relaxed memory ordering e.g. simpler cache coherence. If you know you’re going to be running a lot of message-passing-style programs that might just be OK.”
Sure. And it would be fine if you don’t run multithreaded programs either. Otherwise, it’s hell.
Anyway, this is the tail wagging the dog. Pushing the complexity up to the programmer is almost always the wrong call.
If we’re eventually going to have “many-core” e.g. lots more than eight, e.g. hundreds of cores, we’re going to have to learn to live with more permissive consistency models, at least as an option. I don’t see how else to scale cache coherency. Pushing the complexity to the programmer is how the sophisticated HPC programmer gets performance. (And bugs, of course.) Of course, I wouldn’t mind if they could just stop at eight cores and give us ILP guys another chance.
I don’t know – is it that hard to make it scale? Niagara somehow makes it work with 64 cores…