I am trying to get corpus linguistic methodologies involved in studying language maintenance and /or shift instead of using traditional method such as questionnaire, interviews...etc.
As Landweer (2000) maintains, like people languages may be born or die. As such language shift, the replacement of one language by another, may be primarily caused by the means of communication and/or socialization that bilinguals or multilinguals experience in their everyday lives.According to Fishman (1991), the basic tenet underlying language maintenance is the so-called intergenerational transmission of language-that is, the extent to which parents or the community at large try to preserve the ethnic language. On the other hand, language shift may be brought about by several factors including national language policies, speakers' educational levels, language use in different domains, attitudes towards the contact language, etc. For a more detailed account of the issue, I refer you to Mufwene (2008).
Studying language shift through documented language use is obviously better than questionnaires etc, but it presupposes some language tools that may not be available for all languages. I am speaking of general, multipurpose language tools here.
Ideally, you need (1) a corpus covering the period you wish to study (from before to after), (2) a digitised speller, i.e. some kind of full form generator documenting the standard. You can then run the full form generator against different time sections of the corpus and look at what you get, specify etc.
There are a number of difficulties which will need to be dealt with, the first one being the size of the corpus. For Norwegian we have both the full form generator and a corpus of 100 million + tokens, covering the period 1866-2015. But the full form generator would be useless on the text from before 1940, because the orthography before 1940 is too variable, and the text mass too small.
Then you should ideally have comparable text selections from the different time periods. But our corpus has virtually no newspaper text from before 1998, while newspaper text dominates after 1998. This is because early newspaper text has to be keyed in manually, and keying costs money.
We have run the sort of comparison I talked about earlier on the post-1940 part of the corpus. We were faced with some results which I believe would turn up for most languages, i.e.
half the possible word forms from the full form generator were not found in the corpus. (Norwegian is a medium inflected, compounding language - eight possible noun forms)
half of the word forms from the full form generator that occurred in the corpus, occurred only once.
half the tokens of the corpus occurred once only.
ca 350 tokens equalled 50 % of the corpus token mass (prepositions, conjunctions etc)
After this possibly discouraging comment, I would still encourage collecting actual language as a corrective to questionnaires etc. Ideally, such materials should be formatted and saved as part of a larger text corpus.