The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.