I already implemented the preprocessing part related to stopwords, punctuation, etc but the model provides low validation accuracy (around 0.7) so I wonder that maybe if I also implement the stemming and lemmatization maybe it will increase.
Sur I need to do it manually since I'm working on the tunisian dialect and no already availabe libraries for that.
I don't think that it's a good idea to implement a lemmatizer and a stemmer for Tunisian dialect or for any other dialect since the majority of text doesn't respect the linguistic rules. The Maghreb Arabic dialects are a mixture of Arabic, French and other languages. You can try to collect a huge amount of data and train a BERT Model from scratch.
No, training a BERT model will provide you with word representations that are contextualized, which could lead to good results when it is fine-tuned for other tasks. However, you will need a gigantic amount of data to build the model.
Semeh Ben Salem They are similar except that lemmatization keeps word-related information
such as PoS tags. It will be difficult to answer the question without experimentation as the dataset and the specific task to be solved need to be considered. I advise you to try both and then decide. In case you cannot afford the computation cost of this solution, advise you to use lemmatization as from my experience it brings more impact. Moreover, it's more used by NLP researchers. Once again, you should noticed that everything depends on your data and your specific task.