"A statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken". (Wikipedia)
The sufficient statistic T(X), a function of the sample X=(x_1, x_2, ..., x_n), where x_j are iid, is connected to an unknown parameter Theta (either scalar or vectorial, multidimensional) which occurs in the distribution. Is it T(X) sufficient statistic for Theta, we can confirm e.g. by the Fisher-Nyman factorization theorem - check the www for.
"Fitting a softmax function can be done using the iteratively reweighted least squares (IRLS) algorithm. We use the implementation from netlab. Note that since the softmax distribution is not in the exponential family, it does not have finite sufficient statistics, and hence we must store all the training data in uncompressed form. If this takes too much space, one should use online (stochastic) gradient descent (not implemented in BNT)."
So, maybe it is a technical question about fitting and not a mathematical one.
Probably you should follow the guides of the last link...
A sufficient statistic, is a way of condensing information provided by the data set into a smaller amount of data in such a way that we have not lost any information about the unknown parameter. For details please see the attached file.