In recent years, deep neural networks (DNNs) have provided high performances for various tasks, such as human activity recognition (HAR), in view of their end-to-end training process between the input data and output labels. However, the performances of the DNNs are highly dependent on the availability of large-scale data for their training processes. In this paper, we propose a novel dataset for the task of HAR, in which the labels are specified for the working environments (WE). Our proposed dataset, namely HARWE, considers multiple signal modalities, including visual signal, audio signal, inertial sensor signals, and biological signals, that are acquired using four different electronic devices. Furthermore, our HARWE dataset is acquired from a large number of participants while considering the realistic disturbances that can occur in the wild. Our HARWE data is context-driven, which means there exist a number of labels in it that even though they are correlated with each other, they have contextual differences. A deep conventional multi-modal neural network provides an accuracy of 99.06% and 68.60%, for the cases of the easy and difficult settings of our dataset, respectively, which indicates its applicability for the task of human activity recognition.