Is there any known analytical result or (very) fast algorithm (with proof of correctness) giving the maximum likelihood sample when drawing from the Multinomial MULT(N ; P1 ... PM), with N
My previous answer assumed you meant 'maximum likelihood estimator'. I hope that is what you meant. Or did you mean 'take the most likely sample'? I'm not sure if that makes sense, but it is easy enough to take a random sample. However, there are up to M-1 tests to check for each simulated observation, and if M is large this could be time consuming. For efficiency you could order the pi from greatest to smallest.
On further thought, perhaps there is a way to choose the most likely sample. If N = 1, obviously we should choose the class with the greatest pi. Similarly to some apportionment methods for proportional representation elections, or the US House of Representatives, allocate the sample members one by one maintaining the maximal property at each stage. Can that be done? If class i already has xi observations then allocating the observation to that class will increase the probability by a factor of ((N+1)/N)pi/(xi+1). So allocate to whichever class has the greatest value of pi/(xi+1).
This method gives the maximum increase at each stage, but it is still necessary to show that after allocation it is not possible to move an observation between classes without reducing the probability. Edward Huntington investigated several criteria for rearrangements in this way. In your case, transferring an observation from class i to class j increases the probability by a factor of pj xi/(pi (xj+1)).
By the way we did the allocation, if class i is the latest to be selected, this ratio is less than 1. Otherwise the xs are as they were at the time they were allocated so the same argument applies.
So I think we have proved the allocation method correct. I think this coincides with Jefferson's method that has been used in the past for the US congress.
reduce the problem to a binomial one (argmax(P) versus all the others), allocate optimaly between the two groups, then recursively treat the remaining groups
but i was quite unsure that this greedy allocation was indeed optimal (extensive simulations do indicate that it is ...) ; thanks for the proof !
The way you describe your method, it is not quite the same as my suggestion, but maybe it is equivalent (or maybe not). My suggestion was to add one observation at a time so that the sample size increases by 1 each time.
All methods of apportionment have paradoxes. The grouping paradox is that you can get a different allocation from the direct one by dividing the classes into groups, allocating the sample to these groups, and then allocating within the groups.
When N < M notice that the most probable sample gives no observations in at least M-N classes, whereas a random sample would give all classes a chance of selection. This demonstrates the grouping paradox, because one group could contain all the classes that would have no observations under the most probable allocation. In many cases this would give them some observations.
I am not sure that grouping into the largest pi and the rest would necessarily avoid the grouping paradox, although perhaps it would. But it is faster than my one by one approach.
My rule was: allocate to the largest pi/(xi+1). So at each stage we divide all the pi (or the expected values Npi) by one more than the number already allocated to the class. The class with the greatest value of this gets the next observation. This is d'Hondt's method which is known to be equivalent to Jefferson's method.
Jefferson's method should be faster. It is an iterative approach which allocates the whole sample all at once, but needs further adjustment. First calculate the expected values (which are not integers) and round them down This is the first trial allocation. The sample size will be too small, so scale the expected numbers up and allocate again. Repeat if the sample size is too small; if it is too great scale the expected numbers down. A binary search should succeed in about 3 or 4 iterations. The trick is to automate the choice of scale factors.
The worst case is when max(Npi) < 1 so the first iteration gives a total sample size of 0.