It is sampling problem in each learning iteration. If the training data has more noisy data and the CNN-LSTM is learning more from this noisy data the model will over-fitting the training data with lower accuracy in unseen test data (predictive accuracy). On the other side, if the CNN-LSTM learns from small portion of the training data with neglecting the rest data, the model will have under-fitting problem with lower accuracy in training and testing data. Using validation or cross-validation may manage this issue for balancing between over-fitting and under-fitting the model.
Another option, for obtaining learning model with smooth fitting, the learning approach should learn from all training instances in each learning iteration until obtaining a good accuracy in training data and consequently reasonable accuracy on test data that has the same training data characteristics. However, to the best of my knowledge, all machine learning approaches did not include all the training instances in each learning iteration, but they use Bootstrap (Boosting or Bagging) in each learning iteration for faster runtime.
Combining two or three dataset became more challenging because the dataset may be gather with different sensors or if sensors are same then it may the quality of both are not similar. Therefor it is not working together.