I am trying to understand Q-learning; so I had to try my hand on a 3 by 3 grid world in python. The program runs but Q-learning is not converging after several epsiodes. Please, I would need someone to help me go through my implementation and let me know what am not implementing correctly. I have attached the code herewith. All that is required to run it is Python3 and numpy. Also, I have taken time to comment on each line of the code to aid lucidity.