I want to build a corpus to test a language identification system. Can you suggest links to collect textual data for these languages transcribed in Arabic characters :
Sketch Engine contains corpora for Arabic, Punjabi, Persian, Urdu and Malay. You can prepare corpora for the remaining languages yourself using the WebBootCAT functionality in Sketch Engine. In case of any questions, just ask the support team of Sketch Engine at [email protected] (they could also do that job for you, but there would be some costs involved).
The languages listed are the few languages that still use the Arabic script. Formerly, however, the Arabic script was used for writing many other languages, including Turkish, Albanian, Bosnian, Spanish, Swahili and Chinese. It is likely that including such diverse languages may add to the value of the project.