Abstract Successful lip reading requires a mapping from visual to phonological information [1]. Recently, visual and motor cortices have been implicated in tracking lip movements (e.g. [2]). It remains unclear, however, whether visuo-phonological mapping occurs already at the level of the visual cortex, that is, whether this structure tracks the acoustic signal in a functionally relevant manner. In order to elucidate this, we investigated how the cortex tracks (i.e. entrains) absent acoustic speech signals carried by silent lip movements. Crucially, we contrasted the entrainment to unheard forward (intelligible) and backward (unintelligible) acoustic speech. We observed that the visual cortex exhibited stronger entrainment to the unheard forward acoustic speech envelope compared to the unheard backward acoustic speech envelope. Supporting the notion of a visuo-phonological mapping process, this forward-backward difference of occipital entrainment was not present for actually observed lip movements. Importantly, the respective occipital region received more top-down input especially from left premotor, primary motor, somatosensory regions and, to a lesser extent, also from posterior temporal cortex. Strikingly, across participants, the extent of top-down modulation of visual cortex stemming from these regions partially correlates with the strength of entrainment to absent acoustic forward speech envelope but not to present forward lip movements. Our findings demonstrate that a distributed cortical network, including key dorsal stream auditory regions [3–5], influence how the visual cortex shows sensitivity to the intelligibility of speech while tracking silent lip movements. Highlights Visual cortex tracks better forward than backward unheard acoustic speech envelope Effects not “trivially” caused by correlation of visual with acoustic signal Stronger top-down control of visual cortex during forward display of lip movements Top-down influence correlates with visual cortical entrainment effect Results seem to reflect visuo-phonological mapping processes