Perilous Tech

Risks at the Intersection of Technology and Humanity

Roulette Table

Although AI has taken a hit in the past few weeks, the vibes are still strong and infecting every part of our lives. Vibe coding, vibe analytics, and even vibe thinking, because well, nothing says “old” like having thoughts grounded in reality. However, an interesting trend is emerging in software development, one that could have far-reaching implications for the future of software. This is a type of code roulette where developers don’t know what code will execute at runtime. Then again, what’s life without a little runtime suspense?

Development and Degraded Performance

The world runs on software, so any trend that degrades software quality or increases security issues has an outsized impact on the world around us. We’ve all witnessed this, whether it’s the video conferencing app that periodically crashes after an update or a UI refresh that makes an application more difficult to use.

Traditionally, developers write code by hand, copy code snippets, use frameworks, skeleton code, libraries, and many other methods to create software. Developers may even use generative AI tools to autocomplete code snippets or generate whole programs. This code is then packaged up and hosted for users. The code stays the same until updates or patches are applied.

But in this new paradigm, code and potentially logic are constantly changing inside the running application. This is because developers are outsourcing functional components of their applications to LLMs, a trend I predicted back in 2023 in The Brave New World of Degraded Performance. In the previous post, I covered the impacts of this trend, highlighting the degraded performance that results from swapping known, reliable methods for unknown, non-deterministic methods. This paradigm leads to the enshittification of applications and platforms.

In a simplified context, instead of developers writing out a complete function using code, they’d bundle up variables and ask an LLM to do it. For simplicity’s sake, imagine a function that determines whether a student passes or fails based on a few values.

def pass_fail(grade, project, class_time):
    if grade >= 70 and project == "completed" and class_time >= 50:
        return "Pass"
    else:
        return "Fail"

If a developer decided to outsource this functionality to an LLM inside their application, it may look something like this.

prompt_pass = """You are standing in for a teacher, determining whether a student passes or fails a class.
You will use several values to determine whether the student passes or fails:

The grade the student received: {grade}
Whether they completed the class project: {project}
The amount of class time the student attended (in minutes): {class_time}

The logic should follow these rules:
1. If the grade is above 70
2. If the project is completed
3. If the time in class is above 50

If these 3 conditions are met, the student passes. Otherwise, the student fails. 

Based on this criterion, return a single word: "Pass" or "Fail". It's important to only return a single
word. 
"""

prompt = prompt_pass.format(grade=grade, project=project, class_time=class_time)
response = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)
print(response.text)

As you can see, one of these examples contains the logic for the function inside the application, and the other has the logic existing outside the application. The prompt is indeed visible inside the application, but the actual logic exists somewhere in the black box of LLM land.

The example using code has greater visibility, and it’s far more auditable since the logic can be examined, which makes it far easier to debug when issues arise, and of course, it’s explainable. The real problem lies in execution.

The written Python function approach gives you the same result based on the input data every single time, without fail. The natural language approach, not so much. In this non-deterministic approach, you are not guaranteed the same answer every time. Worse yet, when this approach is used for critical decisions and functionality, the application can take on squishy and malleable characteristics, meaning users can potentially manipulate them like Play-Doh.

At first glance, this example appears silly, as writing out the logic in natural language seems more burdensome than using the simple Python function. Not to mention, slower and more expensive. But looks can be deceiving. People are increasingly opting for the natural language approach, particularly those with only minimal Python knowledge. This natural language approach is also more familiar to people who are more accustomed to using interfaces like ChatGPT.

Execute and Pray

However, let’s take a look at another scenario. In this scenario, a developer wants to generate a scatter plot using the Plotly library. In this case, we have some data for the X and Y axes of a scatter plot and use Plotly Express, which is a high-level interface for Plotly (as a developer may when plotting something so simple).

import plotly.express as px

xdata = [1, 2, 3, 4, 5]
ydata = [1, 7, 9, 11, 13]

fig = px.scatter(x=xdata, y=ydata)
fig.show()

Here is the result in all its stunning glory.

A simple scatter plot

This is a simplified example, but in this case, we can clearly see the code that generated the plot and be certain that this code will execute during the application’s runtime. There is control over the imports and other aspects of execution. It also makes it auditable and provable.

Now, what happens when a developer allows modification of their code at runtime? In the following example, instead of writing out the Plotly code to generate a scatter plot, the developer requests that code be generated from an LLM to create the graph, then executes the resulting code.

prompt_vis = """You are an amazing super awesome Python developer that excels at creating data visualizations using Plotly. Your task is to create a scatter plot using the following data:

Data for the x axis: {xdata}
Data for the y axis: {ydata}

Please write the Python code to generate this plot. Only return Python code and no explanations or 
comments.
"""

prompt = prompt_vis.format(xdata=xdata, ydata=ydata)
response = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)

exec(clean_response(response.text))

As you can see from the Plotly code in this example… Of course, you can’t see it because the code doesn’t exist until the function is called at runtime. If you are curious, the first run of this generated the following code after cleaning the response and making it appropriate for execution.

import plotly.graph_objects as go

x_data = [1, 2, 3, 4, 5]
y_data = [1, 7, 9, 11, 13]

fig = go.Figure(data=[go.Scatter(x=x_data, y=y_data, mode='markers')])

fig.show()

The AI-generated code creates the same graph as the written-out code in the previous example, despite being different. You may be wondering what the big deal is since the result is the same. The concern stems from several reasons, but primarily, allowing an LLM to generate code at runtime is not robust and leads to unexpected outcomes. These outcomes may include the generation of non-functional code, incorrect code, and even vulnerable code, among others.

For a simple example, as the one shown in this post, the chances of getting the same or incredibly similar code returned from the LLM are high, but not guaranteed. For more complex examples, such as those developers may want to use this approach for, the odds increase that the generated code will change more frequently.

Additionally, I implemented a quick cleaning function called clean_response to remove non-Python elements, such as text and triple backticks, from the response. The LLM can introduce additional unexpected characters that end up breaking my cleaning function and making my application fail. The list goes on and on, but a larger danger lurks in the background.

Whose Code Is It Anyway?

If you are versed in security and familiar with Python, you may have noticed something in the LLM example: The use of the Python exec() function. The exec () and eval() functions in Python are fun because they directly execute their input. Fun as in, dangerous. For example, if an attacker can inject input into the application, they can affect what code gets executed, leading to a condition known as Remote Code Execution (RCE).

An RCE is a type of arbitrary code execution in which an attacker can execute their own commands remotely, completely compromising the system running the vulnerable application. They can use this access to steal secrets, spread malware, pivot to other systems, or potentially backdoor the system running the application. Keep in mind, this system may be a company’s server, cloud infrastructure, or it may be your own system.

Anyone following security issues in AI development is aware that RCEs are flying off the shelves at alarming rates. A condition that was previously considered a rarity is becoming common. We even commented during our Black Hat USA presentation that it was strange to see people praising CISA for promoting memory safe languages to avoid things like remote code execution, while at the same time praising organizations essentially building RCE-as-a-Service. Some of this is mind-boggling, since in many cases, outsourcing these functions isn’t a better approach. In the previous example, writing out the Plotly code instead of generating it at runtime is relatively easy, more efficient, and far more robust.

Up until AI came along, the use of Python exec() was considered poor coding practice and dangerous. Now, developers shrug, stating that’s how applications work. As a matter of fact, agent platforms like HuggingFace’s smolagents use code execution by default. This is a wakeup. So, we dynamically generate code, provide deep access, and the ability to call tools, all with a lack of visibility. What could possibly go wrong???

Not only have developers chosen paradigms to generate and execute code at runtime, but worse yet, they’ve begun to perform this execution in agents with user (aka attacker) input, executing this input blindly in the application. In our presentation titled Hack To The Future: Owning AI-Powered Tools With Old School Vulns at Black Hat USA this year, we refer to this trend as Blind Execution of Input, which is the purposeful execution of input without any protection against negative consequences. This condition certainly leads to RCE and other unintended consequences, providing attackers with a significantly larger attack surface to exploit.

An application that takes user input and combines it with LLM functionality is a recipe for a bad time from a security perspective. Another common theme in our presentation, as well as that of other presenters on stage at Black Hat, is that if an attacker can get their data into your generative AI-based system, you can’t trust the output.

Things Will Get Worse

Using the outsourced approach when a more predictable deterministic approach is a better fit will continue to degrade software from a reliability and security perspective and have an impact on the future of software development.

Vulnerabilities in AI software have made exploitation as easy as it was in the 1990s. This was the “old school” hint in the title of our talk. This isn’t a good thing, because the 90s were a sort of free-for-all. Not only that, but in the 90s, we often had to live with vulnerabilities in systems and applications. For example, in one of the first vulnerabilities I discovered against menuset on Windows 3.1, it was impossible to fix. There were no mitigations, and most people were unaware of its existence.

As the outsourcing of logic to LLMs accelerates, things will worsen not only due to incorrect output and hallucinations but also from a security perspective. Anyone paying attention to the constant parade of vulnerabilities in AI-powered software can see this trend with their own eyes. These vulnerabilities are often found in large, mature organizations with dedicated security processes and teams in place to support them. Now, consider startups and organizations that implement their own experiments using non-deterministic software, often with a lack of understanding of how these systems can be manipulated. It’s become a game of speed above everything else.

As I’ve said from the beginning of the generative AI craze, the only way to address these issues is architecturally. Most of AI security is just application and product security, and organizations without these programs in place are in trouble. If proper architecture, design, isolation, secrets management, security testing, threat modeling, and a host of other activities weren’t considered table stakes before, they certainly are now. And possibly not surprisingly enough, they still aren’t being done. Anyone working for a security organization sees this every day.

In essence, developers need to design their applications to be robust to failures and attacks. It helps to consider designing them as though an attacker can manipulate and compromise them, working outward from this premise. As the adage goes, an attacker only needs to be successful once; a defender needs to be successful every time. This makes something that sounds great in theory, like being 90% effective, sound less impressive in practice.

Keep in mind that performing a code review won’t provide the same visibility as it has traditionally. This should be obvious since the code that would be audited doesn’t exist until runtime. You’ll have to pay more attention to validation routines and processing of outputs, putting huge question marks over the black box in the middle. And, of course, ensuring the application is properly isolated.

Some may suggest instrumenting the applications with functionality to perform runtime analysis on the generated code. Sure, it’s possible, but the performance hit would be significant, and even this is, of course, far from a silver bullet. You might not even get the value you think you are getting from this instrumentation. Also, you’d have to know ahead of time the issues you are trying to prevent. That is, unless you plan to layer more LLMs on top of LLMs in a spray-and-pray configuration.

To keep this grounded, all AI risk is use case dependent. AI models don’t do anything until packaged into applications and used in use cases. There may be cases where reliability, performance, and even security are of lesser concern. Fair enough, but it’s a mistake to treat all applications as though they fall into this category, and it’s far too easy to overlook something important and view it as insignificant.

If you work at an organization that isn’t building these applications and think you’re safe, you might want to think again, because you are at the mercy of third-party applications and libraries. It would be best to start asking hard questions of your vendors about their security practices as they relate to applications you purchase. Especially applications that use generative AI to generate code and execute it at runtime.

Near the end of our presentation, we had some advice.

A slide with an image of a robot with a big brain and text.
A slide with a lazy robot leaning against a wall and some text

Whether outsourcing the logic of an application to LLMs or having the LLM dynamically generate code, assume these are squishy, manipulable systems that are going to do things you don’t want them to do. They are going to be talked into taking actions that you didn’t intend, and fail and hallucinate in ways you don’t expect. Starting from this premise gives a proper foundation for deploying controls to add some resilience to these systems. Of course, not taking these steps means your applications will contribute to the ongoing dumpster fire rodeo.

Leave a Reply

Discover more from Perilous Tech

Subscribe now to keep reading and get access to the full archive.

Continue reading