In ecological studies, machine learning models are increasingly being used for the automatic processing of camera trap images. Although this automation facilitates and accelerates the identification step, the results of these models may lack interpretability and their immediate applicability to ecological downstream tasks (e.g occupancy estimation) remain questionable. In particular, little is known about their calibration, a property that guarantees that confidence scores can be reliably interpreted as probabilities that a model9s predictions are true. Using a large and diverse European camera trap dataset, we investigate whether deep learning models for species classification in camera trap images are well calibrated, or in contrast over/under-confident. Additionally, as camera traps are often configured to take multiple photos of the same event, we also explore the calibration of predictions at the sequence level. Finally, we study the effect and the practicality of a post-hoc calibration method, i.e. temperature scaling, for predictions made at image and sequence levels. Based on five established models and three independent test sets, our findings show that, using the right methodology, it is possible to enhance the interpretability of the confidence scores, with clear implication for, for instance, the calculation of error rates or the selection of confidence score thresholds in ecological studies making use of artificial intelligence models.