Grading tea leaves efficiently in a natural environment is a crucial technological foundation for the automation of tea-picking robots. In this study, to solve the problems of dense distribution, limited feature-extraction ability, and false detection in the field of tea grading recognition, an improved YOLOv8n model for tea grading and counting recognition was proposed. Firstly, the SPD-Conv module was embedded into the backbone of the network model to enhance the deep feature-extraction ability of the target. Secondly, the Super-Token Vision Transformer was integrated to reduce the model’s attention to redundant information, thus improving its perception ability for tea. Subsequently, the loss function was improved to MPDIoU, which accelerated the convergence speed and optimized the performance. Finally, a classification-positioning counting function was added to achieve the purpose of classification counting. The experimental results showed that, compared to the original model, the precision, recall and average precision improved by 17.6%, 19.3%, and 18.7%, respectively. The average precision of single bud, one bud with one leaf, and one bud with two leaves were 88.5%, 89.5% and 89.1%. In this study, the improved model demonstrated strong robustness and proved suitable for tea grading and edge-picking equipment, laying a solid foundation for the mechanization of the tea industry.