Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention network to extract high-level emotional feature representations from the modulation-filtered cochleagram. Our approach utilizes channel-level attention and spatial-level attention modules to generate emotional saliency maps of channel and spatial feature representations, capturing significant emotional channel and feature space from the 3D convolution feature maps, respectively. Furthermore, we employ a temporal-level attention module to capture significant emotional regions from the concatenated feature sequence of the emotional saliency maps. Our experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset demonstrate that the modulation-filtered cochleagram significantly improves the prediction performance of categorical emotion compared to other evaluated features. Moreover, our emotion recognition framework achieves comparable unweighted accuracy of 71% in categorical emotion recognition by comparing with several existing approaches. In summary, our study demonstrates the effectiveness of the modulation-filtered cochleagram in speech emotion recognition, and our proposed multi-level attention framework provides a promising direction for future research in this field.