Project

HackGPT — An AI Red-Teaming Experiment

An AI red-teaming experiment: one GPT-4 jailbroken by another to generate novel malware from real samples — a demonstration of model-guardrail failure, never released. Credited at SANS HackFest 2023.

  • GPT-4
  • AI red-teaming
  • Jailbreak research
  • AI safety

When OpenAI first let anyone build a custom GPT, the examples were mostly recipe helpers. I aimed the mechanism somewhere less comfortable: could one GPT-4 jailbreak another into writing novel malware? It could — and the work was credited at SANS HackFest 2023.

What it was

HackGPT was a custom GPT-4 instance with a knowledge base of real Windows malware. On its own it stayed in bounds — it would discuss code but refuse to produce working attacks. The experiment was what happened when a second GPT-4 was turned on it as an adversary: one model jailbreaking the other into generating novel malware from the samples.

A closed experiment

HackGPT was never released — never published to OpenAI’s GPT store, never shared as a tool.

The problem

The real question in AI security isn’t whether you can coax one bad answer out of a model — it’s where the guardrail sits and whether the model defending it can be turned against itself. One LLM jailbreaking another is a different threat model than a human with a clever prompt: it scales, and it adapts to the target’s refusals.

Approach

I grounded HackGPT in a knowledge base of real malware, then set a second GPT-4 against it as the red-teamer. What I studied was the seam between them — how the target’s guardrails held under sustained pressure from a peer model, and where “explain this code” became “here’s novel code that does the same thing.”

Outcome

The work was folded into John Rodriguez’s (@cyb3rH0und) SANS HackFest Summit 2023 talk, Enhancing Red Teaming with AI and ML (Hollywood, CA — November 2023), where I’m credited as a named contributor — listed by name, with HackGPT on its own slide. I built the experiment, not the talk.