Abstract Since childhood, we experience speech as a combination of audio and visual signals, with visual cues particularly beneficial in difficult auditory conditions. This study investigates an alternative multisensory context of speech, and namely audio-tactile, which could prove beneficial for rehabilitation in the hearing impaired population. We show improved understanding of distorted speech in background noise, when combined with low-frequency speech-extracted vibrotactile stimulation delivered on fingertips. The quick effect might be related to the fact that both auditory and tactile signals contain the same type of information. Changes in functional connectivity due to audio-tactile speech training are primarily observed in the visual system, including early visual regions, lateral occipital cortex, middle temporal motion area, and the extrastriate body area. These effects, despite lack of visual input during the task, possibly reflect automatic involvement of areas supporting lip-reading and spatial aspects of language, such as gesture observation, in difficult acoustic conditions. For audio-tactile integration we show increased connectivity of a sensorimotor hub representing the entire body, with the parietal system of motor planning based on multisensory inputs, along with several visual areas. After training, the sensorimotor connectivity increases with high-order and language-related frontal and temporal regions. Overall, the results suggest that the new audio-tactile speech task activates regions that partially overlap with the established brain network for audio-visual speech processing. This further indicates that neuronal plasticity related to perceptual learning is first built upon an existing structural and functional blueprint for connectivity. Further effects reflect task-specific behaviour related to body and spatial perception, as well as tactile signal processing. Possibly, a longer training regime is required to strengthen direct pathways between the auditory and sensorimotor brain regions during audio-tactile speech processing.