Recently, some work has been done planning and learning in Non-Markovian Decision Processes, that is, decision-making with temporally extended rewards. In these settings, a particular reward is received only when a particular temporal logic formula is satisfied (LTL or CTL formula). However, i cannot find any work about learning which rewards correspond to which temporally extended behavior.
In my searches, i came across k-order MDPs (which are non-Markovian). I did not find RL research done on k-order MDPs.