Drako or what I thought the future of Pentesting AI would look like
This is a post about a project I worked in 2020/2021, and I decided to write about it for posterity since it did take a lot of time and was while it did not get to anything, it is interesting to think about with the current state of LLMs.
In my previous post, Time as a Defensive Barrier, I explored how the rise of machine learning could obliterate one of our most relied-upon defenses: the time it takes a human attacker to understand and exploit a network.
For that post I got one and only one comment but I took it to heart, basically someone being skeptical of AI’s being able to explore unknown vulnerabilities in an environment.
This got me thinking that if I trained a neural network on how it can exploit current systems, when presented with a new system it would know the most probable ways to exploit it as well. The idea was that this neural network could be baked into a container with tools such as metasploit or nmap, and it would be able to exploit systems, elevate privileges and continue going through the network at a speed much faster than what our processes would expect.
In short Drako was a system designed to train a reinforcement learning (RL) agent to “hack” vulnerable boxes using Metasploit or other tools, effectively creating what I’ve called a Pentesting AI (PAI) to challenge the assumption that it was not possible. In this post, I’ll go through Drako’s high-level design, break down its components piece by piece, explain how its training process works, and some other details and learnings I had along the way.
I will first go through the different parts of the system to provide a high level overview:
Exploration Engine
The exploration engine was the one that would provide all the “learnings” for the neural network to improve. It would consist of many containers each having either an assigned HackTheBox machine or a virtual machine running on the same system (created from vulnhub). There would also be an api to ensure the health of the target machine (in case they broke for some reason) and all their actions (attempt at tool X to exploit it and success/failure) would be written to the database through the prediction API.
Training Engine
The training engine would have:
Jupyter notebook: Where I would pretty much play with different algorithms to training the models and test things.
Network Learner: Where I would run the ML model to train on network events (e.g a metasploit exploit for an apache web server)
Privesc Learner: Where I would run the ML model to train on privilege escalation events (e.g an exploit on a windows box)
Orchestrator: This would be the one to ensure we had N Containers running and providing information for the learners.
Recommendation Engine
Interestingly in an unexpectedly prescient move, I created a website with a chatbot, and the idea would be that you would talk to the chatbot (which was using AWS Lex) and it would help you hack a box by asking you to upload an nmap scan, then suggesting an exploit or tool, you would provide new information it would then use to update its state and so on. This now seems totally crazy in a world of LLMs but at the time I thought it was a really revolutionary idea, thinking then I would be able to train the model from the feedback of people talking with the bot (they would say they tried X action and got Y result, then we train the model with it).
And this is how it all worked together:
How it explored and learned
Above is a video of the agent exploring different possible states of the machine (each state might be finding out a particular port is opened, some information on the operating system or similar). Each circle would be the information the agent knows at one point and after performing an action (the line) it gets to a different state. Red circles meant a execute permission in the box with green circles meaning it had root (or system admin) privileges in the box.
The idea being that the model would learn the fastest path from no information to root in the box, using reinforcement learning and the same style Deep Q-Networks (at the end I was testing between DoubleDQN, PrioDQN, and PlannerDQN) it would teach the network that for example when no information you should start with nmap, after that if port 80 is open and you are missing some information maybe its best to check the webserver, if its apache version X then probably exploit Y will work and so on. The reason I splitted the network learner from privilege escalation was simplicity and keeping the amount of parameters in the model smaller (I would have to represent the state considering each port for UDP and TCP and so on as different features of the network).
Some of the biggest challenges with the project
Scale
Obvious in hindsight, but initially I thought it would be a lot simpler, probably due to my lack of knowledge, to be able to train a system from scratch. In practice when starting from a blank neural network, the amount of actions I would have to take and the resources they would imply meant the project would probably not go far without a huge investment in resources (in particular I found RAM for running the vulnerable software or metastploit api’s to be the most difficult thing to scale) or a way to simulate the environment.
Metasploit API
I distinctly remember fighting over and over again with the metasploit api. Having to patch it, dealing with the amount of resources it required and finding ways to properly parse the results on successful exploitation as well as proxying actions to it. Still great memories from it but it was clearly not designed to be run at scale like this. The code for that part is in lib/Common/Exploration/Metasploit/.
Providing the state image to the model
This was another tricky part as mentioned before, how to turn the state that the Pentest AI knows into something the Neural Network can process correctly. For that I would have a raw observation (RawObservation) and then a different one after processing (ProcessedObservation) to calculate things like bonuses for new sessions (e.g 400 for priv esc), penalties for errors and so on. As well as confirming things were correct (if nmap returned a port was opened I would do a TCP check since sometimes there were incorrect outputs).
Training
The training system (lib/Training/ and lib/Common/Training) has the different things I had to implement at the time from scratch or copying from reinforcement learning books. I hardly remember all the details but I do remember how much I struggled with the different alternatives, and why the jupyter notebook was key. It would take a long time to get the experimental data, but once you had it you could test the different models for their predictions. Of course what was harder was figuring out which one was best to add as a predictor to guide the model and their method (like a greedy or epsilon greedy) since it would impact the exploration and performance.
Debugging
As shown by the fact I had to create a video of the graph in action, I had to do a lot of things to understand what the model was doing, ways to print the information, reports on it and videos of the exploration paths to understand when things were working properly or not. At the time I remember fighting a lot with the graph visualisation, something today we could probably have an llm output in a single prompt. (The code for the visualisation is in lib/Presentation/Visualizer/)
You can find the source code here: https://github.com/89berner/drako.git
Some additional notes I remember
- Such a system required a LOT of RAM, I had to buy a used server with 512gb of ram to be able to run the amount of vm’s to train in parallel that I needed, for this the CPU/RAM were the real bottleneck, since training runs required little GPU and the time it would take to get experience from each exploit was much higher.
- It goes without saying that ML is hard, I had to read many books, a couple of coursera courses and specialisations on reinforcement learning to get it working to the state I wanted, and still I would scratch the surface of the possibilities. The most interesting thing I explored was doing simulations with the data, thinking of achieving something similar to what now is common practice for robotics.
- Before going through hosting the vulnerable VMs myself, I started using hackthebox, but the limits, and speed it involved made me prefer local vms. A lot of automation had to be built to also automatically create, reset, and reassign vulnerable boxes.
- I also explored the idea of just creating unlimited machines with random configurations, the idea being it would maybe find out new versions of machines which are vulnerable to exploits, again this was limited by the resources it would take but it was a nice experiment.