I will take a leap of faith on nature of the input data and offer these generic suggestions. This could happen when the training dataset and validation dataset is either not properly partitioned or not randomized. Several factors could be at play here. My suggestion is first to properly explore the dataset to understand the stratifications that exist in the data. Then, optimize the size of training and validation data to test the model. Neural networks should be cautiously used in conditions where they could be over-trained or under-trained unintentionally when no attention to training size is given. The other parameters that must be investigated are the momentum and learning rate. If the learning rate is too high, this could also affect the performance of the model.
As you said, my target values are not balanced, and I'm kind of dealing with imbalanced data, but K-Fold cross-validation works quite sensible and well with this dataset. When I want to use the data splitting method, this issue arises.
The result you are getting in first epochs is not any valid accuracy. Your dataset may not be carefully organized, the model getting odd inputs and giving random result.
In my opinion this is because the loss function, which is generally not convex for neural networks, reached a local sub-optimal minimum within a convergence region. Subsequently, after 100 epochs the loss function exited from that region and moved to a region with a lower error, which corresponds to an increase in the training accuracy. However, the aforementioned situation depends on the particular optimizer that is used to update weights, such as stochastic gradient descent, RMSProp, Adam and so on. Additionally, it strongly depends on the learning rate value. In fact, it is possible to achieve similar results earlier (with lower epochs), by increasing a bit the learning rate. Note that, selecting a high value for the learning rate can cause the loss function to overshoot the minimum value.
Moreover, I think that the accuracy is an inconsistent metric in classification tasks, especially for imbalanced datasets, as you may incur in the accuracy paradox issue: the accuracy value is high, whereas your classification model performs poorly. In this case, the model learned that, in order to minimise the loss function, it needs to classify any record as belonging to the over represented class. In this cases, in order to have a clear vision of how the model is performing, I suggest you to:
Compute and interpret the confusion matrix;
Plot the precision/recall curve;
Plot the ROC curve;
Compute the F1 score.
Finally, one way to deal with imbalanced datasets consists in the removal of imbalance between classes through Random Under Sampling or Random Over Sampling techniques, although both have drawbacks.