What if agentic systems could defend themselves from adversarial attacks? In our paper, we experimented with and developed a defensive system for AI agents to protect themselves against adversarial prompt attacks and jailbreak attempts. We demonstrated that our multi-agentic systems can:

  • Maintain operational boundaries
  • Detect and prevent prompt injection attacks
  • Filter out malicious instructions
  • Preserve their core directives
  • Self-validate responses

This will have a significant impact in the near term, especially as agentic systems are required to function autonomously without human supervision.

Our work is available at: Preprint Guardians of the Agentic System: Preventing Many Shots Jailb...

Our code and experiments are open-sourced at: https://github.com/GitsSaikat/Guardian-Agent

More Saikat Barua's questions See All
Similar questions and discussions