Current approaches to empathetic response generation typically encode the
entire dialogue history directly and put the output into a decoder to generate
friendly feedback. These methods focus on modelling contextual information but
neglect capturing the direct intention of the speaker. We argue that the last
utterance in the dialogue empirically conveys the intention of the speaker.
Consequently, we propose a novel model named InferEM for empathetic response
generation. We separately encode the last utterance and fuse it with the entire
dialogue through the multi-head attention based intention fusion module to
capture the speaker's intention. Besides, we utilize previous utterances to
predict the last utterance, which simulates human's psychology to guess what
the interlocutor may speak in advance. To balance the optimizing rates of the
utterance prediction and response generation, a multi-task learning strategy is
designed for InferEM. Experimental results demonstrate the plausibility and
validity of InferEM in improving empathetic expression.