Image fusion integrates a series of images acquired from different sensors, e.g. , infrared and visible, outputting an image with richer information than either one. Traditional and recent deep-based methods have difficulties in preserving prominent structures and recovering vital textural details for practical applications. In this article, we propose a deep network for infrared and visible image fusion cascading a feature learning module with a fusion learning mechanism. Firstly, we apply a coarse-to-fine deep architecture to learn multi-scale features for multi-modal images, which enables discovering prominent common structures for later fusion operations. The proposed feature learning module requires no well-aligned image pairs for training. Compared with the existing learning-based methods, the proposed feature learning module can ensemble numerous examples from respective modals for training, increasing the ability of feature representation. Secondly, we design an edge-guided attention mechanism upon the multi-scale features to guide the fusion focusing on common structures, thus recovering details while attenuating noise. Moreover, we provide a new aligned infrared and visible image fusion dataset, RealStreet, collected in various practical scenarios for comprehensive evaluation. Extensive experiments on two benchmarks, TNO and RealStreet, demonstrate the superiority of the proposed method over the state-of-the-art in terms of both visual inspection and objective analysis on six evaluation metrics. We also conduct the experiments on the FLIR and NIR datasets, containing foggy weather and poor light conditions, to verify the generalization and robustness of the proposed method.