First up, Ridiculous Fish's article on shared-memory multithreading makes a good case for why testing often isn't likely to find mt bugs.

This is one of those 'how deep does the rabbit-hole go?' kind of posts, and is well worth a read. The characterisation of modern CPUs being vastly optimised for single-threaded code is one I hadn't really considered, and I certainly hadn't realized how much more stringent memory read ordering on X86 was compared to other CPU architectures. This means that MT code running fine on multiproc intel boxes may not behave well on other architectures.

Then a followup re-adjusting newbies to the complexities of using memory barriers for synchronisation.

Finally a small one quantifying the speed impact of cache line bounces.

To me these all add reason to think carefully before employing shared-state multithreading as a central theme in your next architecture.