Purpose: Compounds that act on the central nervous system (CNS) are crucial tools in drug discovery and neuroscience. To discover compounds with novel mechanisms of action, researchers have developed behavioral screens in larval zebrafish including various methods to identify and classify hit compounds. However, these methods typically do not admit intuitive numerical scores of screen performance. This study describes methods to classify compounds simultaneously in zebrafish and quantify screen performance. Methods: We collected randomized, highly replicated data for two sets of compounds: 16 quality control (QC) compounds and a reference set of 648 known CNS ligands. Machine learning models were trained to discriminate between compound-induced phenotypes, compare performance between protocols, and detect hit compounds. Results: Classification accuracy on the QC set was 94.3%. In addition, 106 of 648 CNS ligands were identified as phenotypically active, and hits were enriched for dopaminergic and serotonergic targets. The raw data is included to facilitate replication and data mining. Significance: This study describes methods to evaluate behavioral phenotyping assays, which can be used to facilitate comparison and standardization of data within the zebrafish phenotyping community.