If you only want to synchronize, like in a barrier, there may be hardware facilities, but not in the standard x86 processors we have today (also not in their server variants and also not on the Xeon Phi). Hence, on these processors the barrier has to be done by hand using some shared variable. AFAIK the processor in the Fujitsu K computer has hardware support for a barrier.
Besides writing to a shared memory segment (as mentioned by Kamran), the other low-level methods for communication is pipes and message queues. Each of these three methods have their advantages and pitfalls. These methods would be the lowest forms of communication on the same computer. If you were wishing to communicate with threads on other computers (connected by TCP/IP) then socket programming would be the lowest form.
Personally, I would not program in these low-level methods unless there was a very specific justification for such. MPI is the 'gold standard' in parallel programming and is already optimized, easy to install, and easy to learn.
For low-level, the previous posts cover the options well - basically on a single shared memory machine one can use POSIX threads and message queues (you're welcome to look at code from my OS class for examples - http://www.cse.uaa.alaska.edu/~ssiewert/a320.html) quite effectively, but as pointed out, use of open MPI and more advanced tools will save you time and allow you to scale better. I.e. you also may want to scale to a cluster or distributed system solution using networking (ethernet or Infiniband) requiring message passing, so another tool like MPI that you might want to look at is PVM - http://www.csm.ornl.gov/pvm/ When you consider distributed and/or cluster (a network of uniform nodes), there are many options, but PVM might be a good place to start.
Evan and Sam, MPI is specifically for the communication between processes and not threads. If you have the chance to use threads (because you are on single computer with shared memory) than you should use those. They provide better performance in general and they consume a lot less memory. With processes every process will have a copy of the same initial data (unless you specifically create shared memory). With threads, on the other hand, data is shared implicitly and you have to explicitly allocate memory per thread. Loading 1GB of data per process on a 32 core machine makes a difference if you are using 32 processes or 1 process with 32 threads. The easiest way to use threads is through OpenMP.
Coming back to the original question, which I think is about the underlying hardware implementation used for synchronization of shared memory, there are differences between single-core and multi-core systems. As mentioned before on x86 systems communication is synchronized by using atomic operations. These operate on machine words, i.e. on a single integer value (32-bit or 64-bit depending on the system). Based on this we can implement locks/mutexes, semaphores, or lock-free data structures. Atomic operations guarantee that nobody else can access the same memory location at the same time. On a single-core system running multiple threads this is easy: atomic operations cannot be interrupted and thus finish before a task switch to a different thread occurs. On multi-core systems this is more complicated: doing atomic operations on the RAM itself would be too slow. First, you would have to load the value, do a small computation, and then write back the value. During this time no core would be allowed to access the memory, which would be for several hundred cycles. This is not feasible. Instead, cores are still allowed to use their caches. Whenever the cache is accessed the corresponding memory address is communicated over a special bus using the MESI protocol. This assures the the value you are reading from the cache has not been changed by a different core (and stored only in their cache), and you can request exclusive access to this address during atomic operations. I am not a real expert on this, so you better read about the MESI protocol yourself, e.g. on Wikipedia.
IMO, programmers (all real programmers, not web programmers, app scripters) should understand how the hardware really works. Fundamentally, memory is the *only* mechanism for thread communication - other techniques mentioned here (say, message queues) are composed from memory operations. (Communication is also possible using device IO: network or disk interfaces; they are always slower and at least partly composed of memory operations again.)
Programmers often think memory operates via variables, but in fact memory only operates on cache lines (usually 64B) that compose pages (usually 4KB). Pages are the granularity at which memory can be shared, but within a page, cache hardware synchronizes operations on lines (using usually-proprietary protocols including MESI mentioned above). Memory operations are usually just load/store, but must include some special semantics to provide deterministic synchronization, such as atomic or locked operations. (Interestingly, CPUs mostly don't implement threads, but rather cores - so that x86's monitor/mwait puts a core into a lower-power mode, instead of some sort of thread-switch.)
In other words, any programming language will provide some features (possibly intrinsics or library code) for memory-based atomic operations, mutexes, semaphores, monitors, queues, etc. The performance of these varies, and it's up to the programmer to choose well. Uncontended synchronization is usually cheap (order 100 cycles), but anything contended is much more expensive (1-3 orders of magnitudes higher).
I think that using MPI on a single machine (with a multi-core processor) is really a bad idea. It is really much slower than a direct thread implementation (POSIX Threads). Using OpenMP is an acceptable solution if you don't have any notion of thread programming, or if you don't want to spend some time to learn it. However, if you have time to learn thread programming with explicit synchronization, then you will have a much higher speed up than using OpenMP. The problem with OpenMP, is that it generated synchronization (like mutual exclusion) where it is not necessary. The simple reason for that is that OpenMP does not understand the semantic of the code, but only the directives inserted by the programmer).
Now if you want to use direct thread programming, you will have to learn a little bit about synchronization schemes like mutual exclusion or barriers. It is absolutely necessary as the thread programming model is a shared memory model. As all threads can access the same memory and thus the same data, you must deal with concurrent access, and potential coherency and consistency problems, that must be controlled using synchronization.
@Xavier: MPI on a single machine is not that much slower (depending on how you implement communication). First of all, on Linux the operating system does not distinguish between threads and processes how they are handled. The internal implementation is the same. Furthermore, I think that all MPI libraries detect when MPI processes are running on the same machine and immediately fall back to using shared memory for communication (the major implementations of MPI do this). Still, this approach is slower than most direct threading implementations. But, with the MPI 3.0 standard MPI allows for shared memory directly. There are functions which support querying which processes are running on the same machine and then allocating shared memory for these processes.
Don't get me wrong: I am not voting for using MPI, but it might not be as slow as you might think. And it will allow to use more computers in the future.