It's quite some time that I dug into FPGA softcore implementations and related topics... Anyway, not much has changed since then:
Softcores (not only ARM) are notoriously slow performers. The rule "the simpler, the faster" applies as more complex architectures require a lot of logic elements - accumulating a lot of delays. And the longest delay path defines the maximum clock speed applicable.
Softcores are quite resource-hungry: even simple architectures require a lot of logic elements to implement. Seems the Cortex-M1 architecture version has been optimized for softcore applications.
That much about limitations and performance of the softcores. OTOH, the number of hardcores is limited to the number the manufacturer has decided to design-in. Although this shouldn't matter much, considering the resource consumption of the softcores...
Generally speaking, an FPGA implementation of a given design (in this case an ARM processor) will be always slower and more power hungry than an ASIC implementation, as the case of a softcore vs hardcore implementation of a processor in a SoC.
It depends on the fact that the flexibility of an FPGA device (in which you can define a custom circuit implementation after the realization process of the device and you can change it all the times you need) is obtained thanks to a quite complex system of programmable interconnections between general purpose basic cells. It means longest paths and large parasitics that bring up more power consumption and slower operative frequencies.
In an hardcore version of a processor, you have only the logic cells you need and shortest paths between them.
U. Dreher explained it quite well, but I would add one more simple point.
If you compare the area on fabric that is covered by a hardcore ARM processor and a softcore ARM process, the latter will be taking far more than the earlier one.
It is because the FPGA modules (Slices, CLBs, LUTs, BRAMs) are so generic that it will not be best suited for only one implementation.
Not always, but in general, the more area an implementation takes, the more power-hungry and slower it will be on the FPGA's fabric.