The optimization of crop harvesting processes for commonly cultivated crops is of great importance in the aim of agricultural industrialization. Nowadays, the utilization of machine vision has enabled the automated identification of crops, leading to the enhancement of harvesting efficiency, but challenges still exist. This study presents a new framework that combines two separate architectures of Convolutional Neural Networks (CNNs) in order to simultaneously accomplish the tasks of crop detection and harvesting (robotic manipulation) inside a simulated environment. Crop images in the simulated environment are subjected to random rotations, cropping, brightness, and contrast adjustments to create augmented images for dataset generation. The You Only Look Once(YOLO) algorithmic framework is employed with traditional Rectangular Bounding Boxes (R-Bbox) for crop localization. The proposed method subsequently utilises the acquired image data via a visual geometry group model in order to reveal the grasping positions for the robotic manipulation.