As we know,AI is a hot topic in our society from the moment of its born, but we can't get more useful information about it,how does CHATGPT work?where can we get the entrance to use it? and where will it lead us go to?
ChatGPT itself is very eager to explain its own architecture, at least the parts that appeared also somewhere in papers. As an entrance, ask how text are split into tokens, how tokens are represented as word embeddings (big real-valued vectors), how word positions are encoded. Another discussion theme is Transformer architecture in general. I'm currently trying to grasp how this 'attention mechanism' leads to something useful.
as Joachim Pimiskern stated the underlying technology are the transformer neural networks. The original paper on transformers is [1]. The references for the training are [2][3][4][5][6]. using these references as starting point you can look at the references and forward citations to get a comprehensive grasp on how ChatGPt works. I would also encourage you to also look at BART which is Google's equivalent to ChatGPt (see [7] for details).
Note that, I have not given you background material to get to the level of understanding transformers which starts at recurrent neural networks and LSTMs as precursors of transformers. This historical background is needed to fully appreciate and understand why the current architectures are designed the way they are.
References
[1]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[2] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
[3]Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
[4]Neelakantan, A., Xu, T., Puri, R., Radford, A., Han, J. M., Tworek, J., ... & Weng, L. (2022). Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
[5] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., ... & Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33, 3008-3021.
[6]Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.
[7] De Bruyn, M., Lotfi, E., Buhmann, J., & Daelemans, W. (2020). BART for Knowledge Grounded Conversations. Converse@ KDD, 2666.
>>This historical background is needed to fully appreciate and understand why the current architectures are designed the way they are.
Funny. ChatGPT formulated it so:
"As you become more familiar with the Transformer architecture, you'll appreciate the ingenuity of these design choices and the impact they've had on the field of NLP and other areas of machine learning. If you have any more questions or need further clarification, feel free to ask!"
You found the best way possible to get me to write more by comparing my answer to that of Chat GPT :-)
So here it goes
The problem with neural networks without recurrent connections is that the vector do not hold any relationship. The first attempt to solve this problem was in the form of shift registers where the temporal information is processed by shifting the vector from left to right. This was done in [1] where each phoneme had a context of seven letters. Subsequently, Waibel et al. [2] introduced the time delay neural network which is the precursor to convolutional neural networks (no dilation or pooling ) to process 30 ms(phoneme level) features of speech.
These two advances did not have any recurrent connections. The recurrence connections modifications came from three major works. The first was due to Jordan where the output is fed into the inputs. The second is due to Elman [4] where the hidden layer is connected to the inputs. The third and final modification came from Watrous[5] with its temporal flow model.
The major advancement in dealing with neural networks with backpropagation was Werbos backpropagation through time[6]. The problem with this method is that it is vulnerable to catastrophic forgetting after several unrolling steps of the process. Thus capacity to store previous sequences is limited. The mayor breakthrough to overcome this limitation was the introduction of the LSTM [7] that deals with the vanishing gradients problem.
The LSTMs dominated the scene until the transformer architecture was introduced. The transformers substantially improved the problem of catastrophic forgetting over LSTMs.
It is also worth mentioning that the best reference that I have so far on the architecture configuration for transformers as translators is [8] especially Figure 1.
References
[1]Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text. Complex systems, 1(1), 145-168.
[2] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3), 328-339.
[3] Jordan, M. I. (1997). Serial order: A parallel distributed processing approach. In Advances in psychology (Vol. 121, pp. 471-495). North-Holland.
[4]Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2), 179-211.
[5]Watrous, R. L., & Shastri, L. (1987). Learning phonetic features using connectionist networks. The Journal of the Acoustical Society of America, 81(S1), S93-S94.
[6] Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550-1560.
[7] Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.
[8] Schwenk, H. (2012, December). Continuous space translation models for phrase-based statistical machine translation. In Proceedings of COLING 2012: Posters (pp. 1071-1080).
Experts in the field say there is no General Artificial Intelligence yet ; despite the hoopla , it is still not AGI . Matching for response is basically just matching frequencies in a very large collection of writing and also respond accordingly