As Peter says - the terms tend to be used loosely:)
However strictly speaking A running in parallel with B means that A and B execute at the same time - thus requiring two execution units. A running concurrently with B on the other hand means that as a programmer you must ensure that your code is correct in the three cases; A runs in parallel with B, A runs first and the B, B runs first and then A. (A and B may also be interleaved through preemptive scheduling but this is OK if you are OK with the 3 other scenarios).
A rule of thumb is that parallelism is used for high-performance-computing while concurrency is used for utilisation. (though concurrency as a design-pattern is also a powerful too for ensuring correctness. Yes: A correctly designed concurrent program is in-fact simpler to get right than a sequential program for many scenarios).
We still teach CSP for correctness: http://en.wikipedia.org/wiki/Communicating_sequential_processes
You may think of concurrency as having the impression that tasks run in parallel. For that to happen flawlessly, you should exercise the techniques known parallel programming. When actually running, the tasks may be using multiple cores in a single machine, using different cores in different machines, or sharing the same core. All three kinds of executions are "concurrent", but to differentiate them we may save the term to the third type, and call the first type "parallel" and the second "distributed".
As Peter said, the terms may be used in different ways, so whenever you are reading something, make sure you understand what the terms mean in that context.
Back in the 70's "concurrent" was used for single-cpu time-slice systems, like the IBM mainframe computers. One cpu running multiple programs and switching by either a "yield" function, or a timer interrupt. That eventually evolved into the Windows and Linux systems that still run by time-slice on single cpu systems.
I write parallel HPC algorithms for multi-core or multi-node operation, and although I am lazy and use parallel for most things, if pressed I would adhere to the notion that to call something a "parallel" algorithm it has to be theoretically capable of managing and exploiting simultaneous execution on multiple hardware units to some performance advantage. Whether that is multiple cores or multiple nodes.
I think I can understand a little more about the difference. However, my concern is actually in the low-level domain of program execution in multi-core and multiprocessor systems. It appears a perfect utilization of multi-cores would be designing algorithms and developing programs that would allow parallel execution rather than just concurrent program execution. I raised this question to increase our awareness of how software design has actually lagged behind the hardware design. Whether working as a computer engineer or scientists, we should be able to design programs that utilise the parallelism already provided in computer hardware. Multithreading in Java is actually not parallelism in program execution. It seems a high-level programming language that exhibits parallel properties are yet to be developed.
>> I raised this question to increase our awareness of how software design has actually lagged behind the hardware design.
I doubt we needed to be made aware of that; you sound like you think we are idiots.
I write a lot of HPC for a living, and a good means of automatically finding parallelism (by which I mean finding most of the possible parallelism in solving a problem) is a ridiculously difficult and probably impossible problem; it will likely take an AI on the level of a conscious person that can actually understand the problem to be solved and invent a method of doing it in parallel.
I know professors that have devoted most of their career to developing automatically parallelizing compilers and research into parallelizing. It is kind of like the tens of thousands of physicists searching for a Theory of Everything; despite a massive effort and billions of dollars spent, they haven't found one and all promising leads are petering out.
So I don't think there is a good "language" for parallelism, we'd see hints of it by now. To me, the best language for parallelism is probably C with pthreads, the easiest to program is probably OpenMP. CUDA is not bad for some very specific kinds of problem that do not branch much.
Hmm - either I don't understand your question or you don't understand my answer - most likely a bit of each:)
I don't believe it makes sense to talk about programming for a multicore without more context. The choice between concurrency or parallelism is all about context. If you a a strictly deterministic algorithm running on a (mostly) deterministic system then parallelism will always be both easier and run faster. If however, your algorithm is highly nondeterministic or your system is not isolated from external effects (such as running on a desktop based operating system) then concurrency is probably the way to go!
So the choice is, as always in computer-science, a question of how many boundary conditions you can specify for the problem at hand. In fact I have just proposed a workshop entitled 'The role of Concurrency in the HPC Center' at this years Communicating Process Architectures conference (stil don't know if it will be accepted), where the focus will be that indeed there are problems left where concurrency is still the answer.
As for languages.... I personally don't hire people who have a favourite programming language - different tools for different tasks - though I have yet to come across a problem in my domain where Java is the correct tool:)
Peter proposed Occam - that still exists and you should try it out, if you cannot be bothered to learn a new language for trying concurrency I could (shamelessly) propose that you look into one of my own projects PyCSP - which mixes CSP with Python for a more sleek learning-curve. If you want to try strict synchronous parallelism you can stay in Python as well and express your parallelism as vectorization in NumPy.
Although strictly not necessary, parallel programming in high performance computing almost always use Message Passing Interface (MPI) API to distribute a single job on many distributed resources. Typically, there is a master thread and several slave threads called processes. The processes may or may not be executing same piece of code at the same time. Processes may independently execute different section of the data. They may exchange the contents of local memory periodically during the computation as and when required. Jobs where processes need not communicate during the whole process, the computation is called "Embarrassingly Parallel Jobs" . All components of parallel process end at the same time when successful although some processes may finish their part little early when there is a load balancing issue. There are two kinds of parallel computation in general Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD). SIMT is what Graphical Processor Units (GPUs) normally does.
Concurrent job on the other hand need not exchange the memory contents between various independent threads at the same time. They may execute at the same time and one process does not control the other process during the computation. One can have an independent workflow program (scientific workflows) which may control how the jobs are executed in sequentially or concurrently depending on the algorithm.
One can say concurrent jobs are loosely coupled where as parallel MPI like jobs are tightly coupled.
Prakashan: I think you are mistaking your personal experience for a general experience. MPI has been less than 5% of my focus in the last 10 years dedicated to HPC, most of my work is with C and pthreads, and I see a lot of work dedicated to multi-core and GPU applications. These typically do not communicate with any particular protocol, but by sharing common memory STRUCT or object definitions.
In my experience MPI is mostly used for communications between nodes in a network of computers, and is not really "high performance" in the sense of getting close to peak machine performance on individual machines. It is more used for getting massively parallel performance on large but highly divisible problems where the latency of communications can be somewhat hidden.
Peter wrote "It seems a high-level programming language that exhibits parallel properties are yet to be developed.". That is certainly not true. Fortran is probably the most obvious counter example.
Fortran includes a DO CONCURRENT ... END DO construct that provides a loop structure with the added semantics that every iteration of the loop can be executed independently and the collection of iterations can be executed in any order. This allows the iterations to be spread across cores of a shared-memory domain. It is up to the compiler to decide the best scheme for doing this. In some ways it is similar to OpenMP, but avoids the problems associated with the fact that the OpenMP spec has lagged behind changes in the Fortran language.
Distributed memory parallelism is covered by the coarray syntax and semantics in Fortran. Each processor of a collection executes a copy of the whole program (each copy is called an "image"), with the processors/images executing at the same time. The language syntax allows you to reference or define variables on other images with simple assignment statements. Synchronization statements are also included to force barriers between images to get correct program ordering. This model is similar to MPI (both SPMD), but a lot easier to use. Also, unlike MPI, the compiler understands all of the Fortran typing system, so cross-image transfers of parts of an array, or complicated structures are much simpler. [MPI is still widely used with Fortran programs, and the MPI spec defines a Fortran interface. That said, the MPI model of library calls is really more similar to the philosophy of C, and I see its usage with Fortran probably declining over time.]
Given the target audience, Fortran is mainly used on large, parallel systems. Hence, it is not as well known in the general programming community. However, with multiple levels of hardware parallelism moving to smaller and smaller systems, the motivation to learn Fortran might spread.
Bill: I have been a Fortran programmer for 39 years. I am fond of it. I still use it for engineering and mathematical applications. I hate CPP. But I do not think Fortran will ever catch back up to the productivity standards of CPP, in terms of fast code production for business and other applications. That race is lost!
And Fortran does not offer anything fundamentally better than pthreads or OpenMP or MPI, all of which I have used professionally in the last few years for one project or another. They all still require a human programmer to formulate a problem in a parallelizable way, or tell the compiler the specific areas of code that can be safely parallelized.
In that sense, OpenMP is the best for flexibility IMO. I think pthreads is best when I want precise control for high performance. And MPI is needed if I am using multiple nodes on a communications network, instead of a multi-core system with shared memory.
I have yet to see a language that does any kind of good job in discovering parallelism on its own. Students working on their first class in parallel programming can do a better job than compilers in this regard, because unlike the compilers they understand the problem to be solved. That kind of understanding is difficult to portray in a way that computers can understand, and until we figure that out I don't think we will have a good "language" for parallel programming, meaning one that lets the computer figure out the best way to exploit parallelism instead of relying upon the programmer to tell them what to do in parallel.
Anthony: I agree that the hard part of parallel programming is formulating the problem so it can execute in parallel. Practice and examples help, but it is not that easy to teach (or so experience would suggest). I do not agree that "MPI is needed...". Fortran is now natively SPMD parallel, and distributed-memory programming is built in. I write parallel code for distributed-memory parallel machines, and never use MPI.