Erlang is very good for massively parallel applications. It has a proven trackrecord in telecommunications systems and is also used by large service providers such as Amazon, Yahoo, Facebook, Motorola, Ericsson etc:
What kind of requests should those threads respond to? Without any details, one may answer: just use any Turing complete language with socket communication library. Guile Scheme, for instance. It is beautiful language.
Timing is language-independent. Some languages may provide "faster" responses than others, and if you're looking for "the fastest one", that will be (at least) somewhat) dependent upon your application code (the algorithms that you use in the threads - Bubble sort versus Quicksort, for example) and the hardware environment (cache sharing dependencies of your algorithm, number of hardware cores,etc.)
In order to find the fastest, you'll likely have to benchmark your actual code in several languages (don't know if that's your goal or not?).
Also, it is not clear from your description whether you want to time each thread individually (so 400 results), or just as an aggregate (how long from the start of the first thread to the completion of the 400th thread).
Whatever language, you'll have to do something like the following:
For "aggregate" time (start of 1st thread to completion of 400th):
1. Create all 400 threads BUT DON'T YET START them (unless you want to include thread creation overhead in the results)
2. get current time (finest resolution available - nanoseconds would be best, unless your threads will run for several seconds each)
3. start all 400 threads
4. wait for all 400 threads to complete
5 Get current time and subtract start time = time
For individual times: (each thread has its own time):
1. Create a timing array of 400 elements, each element will contain a place for both a start and a stop time.
2. Create all 400 threads BUT DON'T start yet (unless creation overhead needs to be included)-
3. Loop through all 400 threads and for each one get current time (fine resolution, as above), save that in the "start" field of the timing array element for that thread, and start the thread
4. as each thread completes, get current time and store in corresponding "stop" time in the array element for that thread
After all 400 threads complete, your 400 element array will be populated with the start and stop time for each thread - you can loop through the array and subtract staqrt time from stop time for each thread, and perform whatever statistical analysis desired on the times (average, std dev, media, whatever)
Note that the "main" thread (the one that creates, starts, and waits cor completion of the threads) will be a 401st thread...
Note that you could also pass each thread a pointer to it's timing array element and let each thread store its own start time and stop time - advantage is that the thread scheduling/tear down overhead will be eliminated from your numbers (will be more representative of actual thread run time), BUT may cause more cache contention in a multi-cpu environment which could affect your results if thread duration is short, AND if your goal is to find the fastest, thread scheduling overhead may be a consideration (and therefore you may want to capture that)
I have 2 groups of services(flight and hotel reservation) to intergrate, each group consist of 4 web services having similar functionality.
I need to record response time of 400 threads during parallel execution in a distributed environment, I want to send 400 request in same time and record them individually for each state and each state has its own pattern( if t5 sec delay (Symbol B), not replied so long is C)
From the above, it appears that you want responsiveness (fast transaction completion times)?
Keep in mind that there is a performance penalty in task switching between threads ("Context switching") and cache management performance issues if the same chunk of data (cache line) is required by different physical cores within short periods of time,
In addition, increased memory load (stack space is required for each thread - 400 threads times 1M stack for each = 400M just for the stacks, although you can change the per-thread stack size).
If you run 400 threads on - say - 32 cores, you will have a lot of context switching overhead. (in this case, you will have 13 threads per core - each contending for access to the CPU).