Volume electron microscopy (vEM) is a state-of-the-art technique for visualizing 3D structures of biological systems such as cells, tissues and organs at nanometer resolution. But the conflicting requirements of large scale, high throughput and isotropic resolution remain a bottleneck for vEM. In this work, we developed the method EMformer, leverages the video transformer model architecture to boost the axial resolution and make isotropic reconstruction. EMformer adopts a self-supervision strategy that requires no ground truth for training. Instead, it utilizes high resolution horizontal information to guide the recovery of missing axial information. More importantly, EMformer achieves isotropic reconstruction and session inpainting simultaneously in the case of session missing or inevitable damage. Different from the existing deep learning methods based on 2D models, EMformer makes full use of the three-dimensional spatial continuity of biological structures, thus achieving a higher resolution (improve by ~50% measured by FSC), more continuous and reliable ultrastructure reconstructions than existing methods. Moreover, EMformer can achieve arbitrary-scale isotropic reconstruction, even for fractional anisotropic factor. This enables EMformer to achieve strong robustness and transferability on 3D EM images of different modalities and different anisotropic factors, indicating its potential as a universal pre-trained isotropic reconstruction model for vEM. Experiments on simulated data constructed on the isotropic FIB-SEM dataset (EPFL) and the real anisotropic ssTEM dataset (Cremi), demonstrate that EMformer achieves the best reconstruction with higher performance metrics and lower uncertainty than competing methods, improves the segmentation efficiency, and statistical analysis accuracy for various structures such as neurons, mitochondria, vesicles and bilayers. EMformer will substantially improve the isotropic reconstruction efficiency and throughput of vEM, and extend vEM to larger biological system with higher resolution.