I am undertaking research on Intelligence, Bias, and Origins: A Dialogical Analysis of AI–Human Epistemology and Evolutionary Discourse.
Profile of Respondent. A Researcher in Computational Linguistics / Natural Language Processing
Please introduce your self when you answering the questions
Questions I need help in answering are the following:
a. How do you select and curate the raw text corpus used for LLM pre-training? b.What criteria determine “in” vs. “out” in filtered datasets (e.g., adult content, non-English sources)?
2. Emergent Reasoning a.What mechanisms give rise to chain-of-thought phenomena in transformer models? b. In your view, to what extent can these models perform genuine logical inference versus pattern matching?
3. Bias Propagation a. Through which stages of training (pre-training, fine-tuning, RLHF) do you see the greatest amplification of gender or racial bias? b.Which technical interventions (e.g. debiasing layers, balanced sampling) show the most promise?
4. Interpretability & Limits a.How do you gauge a model’s “understanding” of a concept versus rote reproduction? What are the current best practices for auditing LLM outputs for hallucinations or factual errors?