ReLU is used usually for hidden layers. it avoids vanishing gradient problem. Try this. For output layer, softmax to get probabilities for possible outputs.
Why do you think that the activation function is the cause of the poor results? There may be dozens of much, much more likely causes. You may use a poor image representation, have too many variables, or too few variables, too many hidden neurons, or too few hidden neurons, or use a poor training algorithm, or use a poor model (e.g. one vs all instead of one vs one), etc. etc. There is a huge literature on the MNIST database. Use the tanh activation function for the hidden layer and for the output layer, and think of more important things.
I agree with Gerard Dreyfus. You should follow his advices. One thing is sure. You have read a lot, work a lot and try many things. Then you will have sense of which solution fits for what kind of problems.