Visual neurons respond selectively to specific features that become increasingly complex in their form and dynamics from the eyes to the cortex. Retinal neurons prefer localized flashing spots of light, primary visual cortical (V1) neurons moving bars, and those in higher cortical areas, such as middle temporal (MT) cortex, favor complex features like moving textures. Whether there are general computational principles behind this diversity of response properties remains unclear. To date, no single normative model has been able to account for the hierarchy of tuning to dynamic inputs along the visual pathway. Here we show that hierarchical application of temporal prediction - representing features that efficiently predict future sensory input from past sensory input - can explain how neuronal tuning properties, particularly those relating to motion, change from retina to higher visual cortex. This suggests that the brain may not have evolved to efficiently represent all incoming information, as implied by some leading theories. Instead, the selective representation of sensory inputs that help in predicting the future may be a general neural coding principle, which when applied hierarchically extracts temporally-structured features that depend on increasingly high-level statistics of the sensory input.