For further information also recommend a text "Mind Mapping and working memory: The mental semantic representation as a mediator between knowledge and knowing." Published by Pensa. Lecce, Italy, 2012 http://www.pensaeditore.it/fcu.html
That's a big question, and an important one, as any effective cognitive architecture needs a model of self and other(s). It is perhaps easier to start by thinking about the person you are talking to, and modeling them - their linguistic, cultural, physical, cultural and life experience. A principle of cognitive science, "the charitable assumption", is that we start from the standpoint that the other is like us except for what we know/expect/recognize would not be like us, and this minimizes the amount of information we need to transfer (only tell them the surprising stuff that we wouldn't expect them to know).
So this model of other needs a model of self as starting point, and moreover, we use this model of self/other for evaluating what we say as we say it, and repairing it on the fly when it doesn't satisfy the expectations of the model (or maybe restarting to have another chance at getting things clear, or saying pun not intended, to make a joke of the unintended ambiguity and let them know that both interpretations happen to be reasonable).
This recognition model might be entirely separate from the synthesis model, and in my 1989 book "Machine Learning of Natural Language" I talk about this in terms of Chris Turk's model of Anticipated Correction, and also connect with neurolinguistic evidence such as our understanding of Wernicke's and Broca's areas as language areas physically proximate to Speech and Hearing areas respectively. The grammatical information is particularly important for us to establish the relationships between the concepts is correctly marked, and hence is associated with this model of self/other. The model of self/other is also likely to be highly collocational, so when we say something that isn't quite right (in our first or second language) we get feedback from the model (which we may choose to use to repair with or ignore). It is more generally likely to be strongly associated with event memory.
What we do as infants tends to be mirrored back to us by others (like peekabo games, or "she said 'mama', 'mama', 'mama'" repetitions) more than the other way round (infants copy their parents less than vice-versa). This then leads to the prediction that neurons associated with us doing something will be reinforced by the mirroring, viz. predicting and explaining mirror neurons. As we get older, the acknowledgement becomes more subtle, but still mirroring is an important aspect of acknowledgement, repeating something back to say you have understood, nodding, reflecting gestures and expressions (this also relates to neurolinguistic programming).
But we are also forming associations with the doer of the actions, and these associations can have an indexing function. That is things that 'I' have done but 'other' hasn't seen/acknowledged/mirrored are things to talk about. Things that are ubiquitous in the world/society/culture, that many people have been seen/heard to do, will involved generalised indexing by any person who meets some linguistic/cultural/age membership criterion.
So the answer to what do we need to model is ... everything that relates to our sensory/motor input/output, including our internal sensing/associating/generalizing/reasoning. But rather than having duplicates of the real language/world/society understanding model for self model and other model, a structure sharing model based on generalised indexing would be parsimonious.
When we model self, we have to think of the behaviors and the activities of the self in various dimensions. Self is not an absolute cognitive state that acquires knowledge and associative skills on the spur of a moment, instead it has accumulated historicity in its learning processes. The mirroring nature of self is part of the broader activity pattern of reflexive elements in cognition. Here cognition and recognition should be understood in relation with each other. Thus in addition to reflexive nature, recognition is influenced by the absorption of social activity patterns on the mirroring self. Hence self develops from a mirror to a filter of its own absorptive and reflexive capabilities. I have used the analogy of optics just to say that mirroring itself is not an end to the development and role of self in cognitive state and further phases of recognition.
Samer, I think the answer depends on the kind of application you have in mind. The architecture may spread from a complex model based on the human mind, towards a more affordable BDI architecture where "self" is represented by internal knowledge of own goals and capabilities.
Ancient Greek caryatids (which derive from similar Near Eastern structures) represent idealized feminine figures serving as columns to hold up temples. Girls from the Greek island of Karyae were considered beautiful, tall, strong, and capable of bearing healthy children. Later on, their male counterparts were Telamon and Atlas, archetypes of masculine strength. Hence the sculptor projected the ideal of his own sex on his architecture. The ancients associated bodiliness with selfhood.
There appear to be two ways of dealing with the self: as a person and as an object More about cognitive structures involved may be found in the attached article.
your answer is very interesting. Your physical metaphor describes the crucial phenomenon of human mind and om any other mind too I think. We need a representation of objects and acts of our surrounding to act in the world and to understand what is going on therein.
A part of these represented objects is the actor itself.