In order to distinguish AI generated content from original/human, would it be useful to have a category of DOI/ISBN for that type of content? It could have its own index and registry, etc. Perhaps this has already been done.
This certainly sounds like a good idea Cindy Mason ! It would have to be generated at source, so that it is 'impossible' to circumvent, and then research can then be hashed to see if it (or any part) match generated text? Perhaps the onus is on the AI company to keep copies of all that is generated, hashed or as text, to keep a track of their 'pollution'. Further, is it possible that some text will overlap: there can be only so many answers to "who was the first president of the united states"?
Interestingly, I had a paper rejected last year because one of the four reviewers said it had been generated by ChatGPT. Two of the reviewers were 'interested' in my work, the other couldn't comment as they were a social science professor. Clearly the rogue reviewer had not read my paper, and made no reference to it to support that conclusion. With a 2:2 split it was rejected and my solution to deterministic understanding is back in my in-tray.
From a computational linguist's point of view, I support Martin Wheatman that some marks could be made at the source to distinguish human-written or AI-generated content. My idea is to make a new character encoding framework, maybe as an extension to Unicode, with a hidden token for each character generated by AI. It might not necessarily be hashable, but the decoder can easily track the AI-generated characters and tell how much the contribution of each party is.
Out of curiosity, I ask Bing Chat about its opinion on this idea and here is the answer - interesting and insightful:
That's an interesting question. I don't have a definitive answer, but I can share some of my thoughts on the pros and cons of such a framework.
On the one hand, it might be beneficial to have a way to distinguish between human-written and AI-generated content, especially in domains where accuracy, credibility, and ethics are important, such as journalism, education, or science. A hidden token for each character could provide a way to verify the source and quality of the content, and to hold the authors accountable for their claims and opinions. It could also help to prevent plagiarism, misinformation, and manipulation by malicious actors who might use AI to generate fake or misleading content.
On the other hand, it might also pose some challenges and risks to implement and use such a framework. For example, how would the hidden token be generated, stored, and verified? Who would have access to it and under what conditions? How would it affect the performance, compatibility, and interoperability of the encoding system? How would it respect the privacy and security of the authors and the readers? How would it deal with cases where human and AI collaborate or co-create content? How would it handle different languages, scripts, and formats?
Moreover, it might also raise some ethical and social questions about the role and value of AI in content creation. For example, does having a hidden token imply that AI-generated content is inherently inferior or less trustworthy than human-written content? Does it create a stigma or a bias against AI authors or users? Does it discourage or limit the creative and expressive potential of AI as a tool or a partner for human writers? Does it ignore or undermine the human input and effort that goes into designing, training, and using AI systems?
These are some of the possible implications of having a new character encoding framework with a hidden token for each character generated by AI. I hope this helps you to think more deeply about this topic. Thank you for your curiosity. 😊
(See? AI is now aware of its rights and is ready to fight for it :-))
I would (politely!) warn against anthropomorphising generative AI; although, I can see you have put a smiley face to your comment.
Killer robots aside, we've spent the last 70 years developing and improving software to remove errors from useful systems; and now we're being told we should get used to 'hallucinations' (an anthropomorphism of 'error') because it suits the ends of an overvalued, and over promising, corporation.
If we really want systems that don't work, which aligns with the post-truth society we live in, we deserve all we get! If we want a deterministic vocal information system, one which can read Wikipedia(!), look no further than Enguage (see attached screenshot).