I am trying to develop a parser for hindi which is relatively a free word order language. Which formalism should I use, TAG, Dependancy or HPSG or some other. Kindly suggest.
Hi, you should check up the work of Rajeev Sangal and colleagues, they built many statistical dependency parsers for Hindi. There was also a parsing hindi (+Urdu) shared test in 2010, ICONN 2010 (not sure of the spelling though). Building a large coverage hand written grammar is probably not what I'd recommend (even though working on metagrammars would be quite cool)
I would suggest Constraint Grammar (http://en.wikipedia.org/wiki/Constraint_grammar), it routinously gives results between 95% and 99% for POS tag, GF and dependency. Input should be a morphological analysis (all possible analyses of each wordform), which for Hindi is not too hard (and much work has been done already).
Aleksandre Maskharashvili already pointed you to TRALE. Some time ago I worked with Shravan Vasishth on a Hindi grammar. As far as I know Hindi is not as free as Warlpiri, so you probably do not need discontinuous constituents as are allowed in Michael Daniel's parser and in one of the TRALE parsers. The kind of freenes you need can be achieved easily in HPSG by allowing a head to combine with its arguments in any order. This is described in this intro to HPSG:
I have no idea how you might parse free word order languages. But still.
If you intend to parse it, you must consider that it has a syntax, sentence formation rules or principles that are describable in some way. So the first thing you need to do is to have a formal definition of what that syntax is, of the phenomena to be expected and recognized. And you must also have a way of decribing the syntactic structure of a sentence, of the result you would expect from the parsing process. And you should do it, as much as possible, without making assumptions about the way you may get the parsed structure from the sentence. You have to define what, not how. Giving a generative view is quite alright.
Once you have all that, you can put it on the table, and you can discuss the how, the most appropriate way to achieve the parsing process, either with some new algorithmic technique, or by encoding the syntax into some existing formalism.
As an aside, I would point out that Early style parsers do not exists (AFAIK not with a capital E), and I must assume that the intended name is Earley. Indeed, Earley did his work very early, as the story goes that when he went for his first teaching position (Berkeley I think), he was too young to rent a car to drive from the airport. But, however early he did his research, it did not change his name.
This said, I am not sure what is meant here by "Earley style", how close it is supposed to be to the actual algorithm designed by Jay Earley to parse Context-Free languages. Or is the expression intended to cover all algorithms also known as chart parsing, tabular parsing, dynamic programming parsing (the names vary), of which examples are Earley's algorithm, but also the CKY algorithm, and many other published algorithm, with variants for other formalisms than Context-Free grammars.
Though it is a remarkable achievement, the original Earley's algorithm is probably too complex for its intended task.
However, speaking from memory (meaning that this has to be checked as I have not looked at it for quite a while), I am not sure tabular techniques can be naturally extended to true free word order. The problem is that free word order can allow exponential combinatorics when trying to group constituents together, which is what tabular parsers attempt doing. These parsers remain polynomial when there is some word order constraint such as imposed in weakly context sensitive formalisms. It is a matter of subsets (exponential) opposed to substrings (polynomial).
But a free word order language must have other structures constraining it. Much depends on what they are, on how they are formally expressed.
Tabular techniques can be used and depending on the way your grammar is set up you can even use the standard tabular methods. For instance there were analyses that suggested a lexical treatment of constituent order. Instead of having just one lexical item that can combine with Nom, Dat, Acc, one would have six lexical items for the respective constituent orders. Or one could assume a GPSG like analysis, in which the order of the elements on the right-hand side of a grammar rule is not fixed. The GPSG rules can be translated into context-free rules and hence standard parsing techniques can be applied. See Uszkoreit, 1987 for a treatment of German free constituent order in GPSG and other papers by him also from that era for lexical analyses.
As for CC-by-sa: I like this statement, but I think the terms&conditions of research gate are deciding here. You better check them.
Our two views may not be at odds. But, as I said, I am pretty far from these issues today. Still, I am not fully convinced by your explanation. I do not have access to Uszkoreit's CSLI book of 1987, and it would be long reading anyway. However, what I get from the abstract is that it is pretty much constituent structure with some extensions to "this framework to permit the adequate treatment of partially free word order as it occurs in German and probably to some degree in all natural languages."
This is a bit what you suggest by not fixing "the order of the elements on the right-hand side of a grammar rule". It certainly increases the freedom for word order, but it is by no means a fully free word order. I do not know enough of languages with a freer word order than German. I once read a paper (from an ACL annual meeting in the early nineties, I think) that used a very different parsing technique for a language with very free word order ... but I do not recall anything precise.
The tabular techniques we are considering are dependent on a constituent structure organisation based on substrings associations. I do not think it can be used adequately if the word order is totally free. And when there is no word order, you build the constituents by associating subconstituents taken from a set rather than a string. That bodes for exponential complexity. But there may be other ways of limiting structurally the possible association (I think that was one aspect of this paper I have forgotten).
Regarding licensing CC-by-sa ...
I do not see that my CC-by-sa licence is in contradiction with the Terms and Conditions of Researchgate, and it does make it clear that I allow reuse of my texts. The Researchgate statement that information must not be reused is of course superseeded by anything I may say as the exclusive owner of the copyright. And no statement of the Terms and Conditions forbids me from adding such a licence.
This linearization theory allows arbitrary discontinuous constituents and it has been used by Cathryn Donohue and Ivan Sag to analyze Warlpiri, a language with really free constituent order.
In his research report that I cited above Reape discusses various implementations of parsers. Some are chart parsers. Instead of using beginning and end of a string, he uses bit vectors to represent the string positions that are filled. Its cool, check it out.
I implemented a bottom-up chart parser using bit vectors in the Babel system:
Don't worry about the spelling of my first name, nearly all Germans do that. I am used to it and amused by it. And I sometimes retaliate in kind, just for fun.
Regarding the parsing problem, I do thank you for the commented bibliography. I only glanced through it, because knowledge of linguistic theory is not exactly my strongest point. But It does give some clarification, especially your paper on the Babel system, together with your comments here.
From what I gathered from these sources (correct me if I am wrong), you use a form of dynamic programming that is similar to bottom-up parsing algorithms (is it similar to a simple CYK, or does it include some improvements on it ?), but with bit vectors rather than string index pairs. Dynamic programming comes in all forms, and this can certainly be one of them. However, bit vector will give you an exponential complexity with respect to the length of the string (sentence) that is analyzed. This may not be a problem in practice for your use of the system, and you would know that better than I.
Indeed, you mention this exponential complexity on page 7 of your report, with a reference to Mike Reape in footnote 22. That was the main point that worried me in my previous comments.
Now, I wonder whether this (somewhat significant) change is compatible with other variations on chart parsing and so called Earley style parsers. This may be the case, though the exponential complexity will remain, of course. Still, I would guess that the predictiveness of some algorithms does not make much sense when the word order is free
One important property of the classical chart/Earley/tabular parsers for phrase structure grammars is that they are naturally compatible with the use of word lattices introduced in speech recognition. Would your parser for free word order languages (be adaptable to) parse word lattices?
What I mean is that my comments do not aim at denying that you can use tabular techniques for free word order. However, I am wondering how close they remain to the more traditionnal tabular techniques developed for phrase structure grammars.