Recent studies suggest that cardiac amyloidosis (CA) is significantly underdiagnosed. For rare diseases like CA, the optimal selection of cases and controls for artificial intelligence model training is unknown and can significantly impact model performance. This study evaluates the performance of electrocardiogram (ECG) waveform-based artificial intelligence models for CA screening and assesses impact of different criteria for defining cases and controls. Using a primary cohort of ∼1.3 million ECGs from 341,989 patients, models were trained using different case and control definitions. Case definitions included ECGs from patients with an amyloidosis diagnosis by International Classification of Diseases-9/10 code, patients with CA, and patients seen in CA clinic. Models were then tested on test cohorts with identical selection criteria as well as a Cedars-Sinai general patient population cohort. In matched held-out test data sets, different model AUCs ranged from 0.660 (95% CI: 0.642-0.736) to 0.898 (95% CI: 0.868-0.924). However, algorithms exhibited variable generalizability when tested on a Cedars-Sinai general patient population cohort, with AUCs dropping to 0.467 (95% CI: 0.443-0.491) to 0.898 (95% CI: 0.870-0.923). Models trained on more well-curated patient cases resulted in higher AUCs on similarly constructed test cohorts. However, all models performed similarly in the overall Cedars-Sinai general patient population cohort. A model trained with International Classification of Diseases 9/10 cases and population controls matched for age and sex resulted in the best screening performance. Models performed similarly in population screening, regardless of stringency of cases used during training, showing that institutions without dedicated amyloid clinics can train meaningful models on less curated CA cases. Additionally, AUC or other metrics alone are insufficient in evaluating deep learning algorithm performance. Instead, evaluation in the most clinically meaningful population is key.