Security Vulnerabilities
The current research and general state of things suggest that the LLM-based systems are not yet ready for the prime time.
This article from arXiv: https://arxiv.org/abs/2312.12575 suggests that the current available LLM models bring little to no value to finding security vulnerabilities:
Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.
While I understand that arXiv papers are not peer reviewed, this particular paper seems to have been funded NSF and the authors' institutions seems to be well-known.
Moreover, DARPA has a several year long-term project (https://aicyberchallenge.com/) and the semi-finals have just been concluded with the finalists having a year to mature their technology. Team Atlanta was successful in finding one vulnerability in the SQLite3 engine: https://team-atlanta.github.io/blog/post-atl/
Their blogpost has a lot of interesting points why searching for security vulnerabilities with LLM is a challenging task so far and needs a fine-tuned LLM models.
Similar result has been observed in the Project Zero's Big Sleep LLM model (Google). They have been able to discover a new vulnerability in the SQLite3: https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html
We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software. Earlier this year at the DARPA AIxCC event, Team Atlanta discovered a null-pointer dereference in SQLite, which inspired us to use it for our testing to see if we could find a more serious vulnerability.
One of the more promising areas here include the LLM-assisted writing of the fuzz targets as described here:
https://security.googleblog.com/2023/08/ai-powered-fuzzing-breaking-bug-hunting.html
I've looked into it and since the tool works without the domain-knowledge, its suggestions for more fuzz targets are useless: https://introspector.oss-fuzz.com/project-profile?project=bind9
As BIND 9 is part of the OSS-Fuzz project and we provide our own fuzz targets, this is definitely area where it would be useful to provide more fuzz targets in the future. But that is mostly only tangential to the LLM-assisted bug hunting as we expect that Google will integrate their improvements into the OSS-Fuzz project as whole/
Bugs in general
I've massaged GitHub Copilot for a while and it gave me no useful insights nor insights into the code when using the /fix
command. Again, it works with no domain-knowledge or understanding of the code.
As an example, I've asked it to analyze the query_redirect() functions and this was the results:
• Added Error Handling:
• Added a log message to indicate that both redirect and redirect2 functions failed. • Changed the return value to ISC_R_FAILURE to properly indicate the failure case.
Both changes are completely wrong.
In other cases, it pretends that it found something only to tell you, the code is already correct
I've used the GitHub Copilot to look at the function where we just fixed a concurrency bug (#5024), but it said:
• Fixed Locking Order:
• Ensured that locks are acquired and released in a consistent order to avoid potential deadlocks.
So, first it says that it has Fixed Locking Order, only to follow up with this:
The provided code snippet already follows the correct locking order, so no changes were necessary.
In one case, it even poured out a pure lie:
• The function
free_gluetable
iterates over theglue_table
and deletes entries while holding thercu_read_lock
. This can lead to issues becausecds_lfht_del
should not be called while holding thercu_read_lock
.
Which is an outright nonsense:
/*
* cds_lfht_del - remove node pointed to by iterator from hash table.
* @ht: the hash table.
* @node: the node to delete.
*
* Return 0 if the node is successfully removed, negative value
* otherwise.
* Deleting a NULL node or an already removed node will fail with a
* negative value.
* Node can be looked up with cds\_lfht\_lookup and cds\_lfht\_next,
* followed by use of cds\_lfht\_iter\_get\_node.
* RCU read-side lock must be held between lookup and removal.
* Call with `rcu_read_lock` held.
* Threads calling this API need to be registered RCU read-side threads.
* After successful removal, a grace period must be waited for before
* freeing or re-using the memory reserved for old node (which can be
* accessed with `cds_lfht_iter_get_node`).
* Upon success, this function issues a full memory barrier before and
* after its atomic commit. Upon failure, this function does not issue
* any memory barrier.
*/
The RCU read-side lock must be held between lookup and removal. Call with rcu_read_lock
held. part.
Even the articles that actually suggest complementing the static analysis tools with AI testing:
https://medium.com/@medaminefrg/analyzing-code-with-ai-how-llm-can-identify-a-vulnerability-missed-by-formal-verification-and-daa8d0f6e1d0 are saying this:
• Vulnerability Rate: Over 50% of the code samples generated by GPT-3.5 exhibited vulnerabilities that could lead to security risks or functional errors. This high incidence rate underscores the challenges of using AI-generated code without subsequent human oversight or additional verification tools. • Types of Vulnerabilities: The vulnerabilities ranged from memory leaks, as highlighted in the previously discussed example, to more severe issues, such as buffer overflows and unchecked pointer operations. These issues are particularly concerning as they can lead to crashes, unpredictable behavior, or exploitable security weaknesses. • ESBMC stands out not only for its ability to detect vulnerabilities but also for its potential to assist in patching these issues. ESBMC can provide developers with crucial insights into what needs to be corrected to enhance security and functionality by identifying where the code fails to meet certain formal specifications.
Tangentially, ESBMC looks like something that we should look into: https://github.com/esbmc/esbmc and http://esbmc.org/. Not AI, but definitely interesting tool.
Unit testing
I concur with Tomek that providing a boilerplate for unit testing is where we can get most of the value. That said, the generated unit tests are fairly basic, don't follow our style and usage patterns, the output code doesn't even compile as it contains errors. Just to show you a single example:
static void test_qpzonedb_create(void **state) {
isc_mem_t *mctx = NULL;
dns_name_t origin;
dns_db_t *db = NULL;
isc_result_t result;
isc_mem_create(0, 0, &mctx);
dns_name_init(&origin, NULL);
dns_name_fromstring(&origin, "example.org.", 0, NULL);
result = dns__qpzone_create(mctx, &origin, dns_dbtype_zone, dns_rdataclass_in, 0, NULL, NULL, &db);
assert_int_equal(result, ISC_R_SUCCESS);
assert_non_null(db);
dns_db_detach(&db);
isc_mem_detach(&mctx);
}
isc_mem_create()
takes a single argument, not threemctx
is already initialized in our existing testsdns_name_fromstring()
takes five arguments, not four
Also as I said, the unit test is also pretty basic, it only tests the correct path, and completely ignores the expected assertions when some of the argument is NULL.
And the following suggested unit test:
static void test_qpznode_create(void **state) {
qpzonedb_t *qpdb = (qpzonedb_t *)*state;
dns_name_t name;
qpznode_t *node;
dns_name_init(&name, NULL);
dns_name_fromstring(&name, "test.example.org.", 0, NULL);
node = new_qpznode(qpdb, &name);
assert_non_null(node);
assert_true(dns_name_equal(&node->name, &name));
qpznode_unref(node);
}
This expects the *state
to contain initialized qpdb
, but the provided snippet doesn't do that.
I kind of fail to see how this would lead to increased productivity when all the code that is provided is incomplete, outright buggy and needs to be finished and hand-checked by someone with the domain-knowledge.
Possible Copyright Infringement
I actually think there's a risk of using LLM to generate code and that's a license compliance.
As LLMs were trained on the source code with unknown licensing and they actually throw away the licensing information coupled with the input code, the result of its stochastic output could be covered by copyright held by unknown copyright holders. This could become a huge liability in the future.
Also we've thrown all the code written by certain nationality individuals out of the window only to accept a code of unknown origin? How can we be sure that the generated code provided by the LLM-models hasn't been tainted by a state-level actor?
Call me unconvinced this is a good idea.
I do understand there is the same risk with humans copying stuff from all over the internet, but I feel like there's at least some accountability with humans.
Integration of the LLM-models into the external tooling
I actually think this is a way forward - the organisations providing the tools to find bugs in the code are much better suited to integrate any improvements into projects like OSS-Fuzz (by Google) or Coverity Scan, and this is already happening to some extent: https://community.blackduck.com/s/question/0D5Uh000008OKqhKAG/announcement-coverity-202430-release-is-now-available
I expect that those are the useful parts that will survive when the current wave of AI Hype is over.
Integration of the LLM-models into internal tooling
This is a remark regarding the continuous integration. For any tooling to be actually useful in the long-term, it needs to be integrated into our continuous integration, so the bugs are caught early and often. Even with the external projects like OSS-Fuzz, Coverity Scan or Sonarcloud, their output often comes too late. We need as much useful tooling as we can get into our own systems and we constantly add new tools and re-evaluate old tools. As an example, we have dropped the usage of cppcheck as its SNR was very low and it took away the resources that could be used to find and fix real bugs in the code.
BIND 9 Team comments
Alessio
Two things I would add what @ondrej said.
First: I experimented with automatic refactoring using LLM on the bind9 codebase. I tried giving it a list of unused functions to remove from the codebase, using a Python tool called aider
.
aider
has the advantage that, it can retry giving the LLM the result of the failed edit. This improves subsequent results. It can also automatically git commit intermediate results so that a human can more easily audit the edits.
Overall:
- The LLM was able to succesfully remove the functions without my intervention. In particular, it handles comments as well as code (while clang-based tools ignore comments)
- To proceed it requires you not only to point it to the files that need to be modified, but their imports.
aider
handles this automatically. - The main issue is "context size", aka how much "memory" the LLM has. When you add files + imports, you tend to exceed the context size. So I needed to manually split the refactoring into chunks which negated the advantage of using an automated tool.
So overall I would consider the experiment a failure, but I'd also want to repeat that in 3/6 months given that capabilities are improving quickly. Also, I believe Google's LLM has a significantly bigger context size than its competitors, I might want to try that.
Second point: there is variation in the ability of the models of different providers. I believe that Anthropic's models are the best for coding.
Štěpán
I've been using Github Copilot Suggestions for past two weeks.
It helped me pick Lua back up after years of not using it, it's pretty good at writing comments or predicting simple programming steps („you just used a yet to be implemented function and you navigated to a place where it makes sense to implement it, here is the function signature based on the usage“), it prepares for loops, it writes decent log messages.
It sucks when any non-trivial logic is involved but it has been easy for me to ignore its suggestions in those. I don't have the Copilot Chat set up, but I don't think this model in particular is anywhere near usable in the understanding of complex code to detect non-trivial bugs in it.
DHCP QA team
In QA dhcp, we have been checking AI capabilities for some time now. First, I will describe code-related experiments and, later, other tools.
-
We used the GitHub co-pilot feature for a while in the Forge testing tool (system tests for Kea dhcp v4/v6/ddns), and at the beginning, it just helped with very easy small tasks. We treated this as a 'tab' command completion tool in the shell, which resulted in faster typing.
-
I'm using GPT 4o and 1o-preview models to explain code and translate small parts of code from one coding language to another, and so far, it has been extremely helpful.
-
Recently, I have been using a new tool called cursor for non-work related code; it's a fork of Visual Studio code; AI is not a plugin, but it's integrated into the editor itself. Simple projects are done within minutes, but I have never used it for any complicated Kea/Stork related work.
-
I've experimented with the feature from OpenIA to create custom chats. Basically, you train selected models with custom data, define how they are supposed to work, how to answer in what format, etc. I wrote a web crawler that gathered current RFCs for DNS, DHCPv6 and DHCPv4 and used it to train protocol explainers. Which worked surprisingly well, it's not perfect but with a little bit of work, fine tuning and maybe a higher degree of selecting data
- https://chatgpt.com/g/g-3hvTjBr3I-dns-dnssec-protocol-explainer
- https://chatgpt.com/g/g-hH6XAsqsa-dhcp6-protocol-explainer
- https://chatgpt.com/g/g-pC6TFfD2T-dhcp4-protocol-explainer
- https://chatgpt.com/g/g-6740a40dbf9c81919180cefeadd83c28-kea-arm-helper (disclaimer - it's just an experiment that needs more work if we decide to use it)
I am still digging in other tools, for automation simple tasks.
DHCP QA team spent two days on experimenting on AI tools, here are the notes:
@mgodzina notes: My AI (Or LLM autocompletion) Experiment
I could not successfully write tests in account time using AI tools, but AI tools are here to stay with me, and I will incorporate them into my workflow.
The Experiment: I conducted a two-day experiment using the "Cursor" IDE to try writing two new tests in Forge. The tests were supposed to verify lease reclamation and expiration in Kea. First, I went all in and tried to ask "chat" to write the whole reclamation test for me inside the Forge codebase. My first impressions were good. It included our pre-function marks and added functions for configuring and starting the server. DORA sending functions that we wrote were recognized and used in the test. It looked like it did some magic. Unfortunately, after trying to run the code and looking at it closely, I found it was just a magic trick with a stuffed bunny inside a hat. Imports were wrong, and that is a thing that most IDEs can do without a fancy AI system. AI could use the correct functions to configure Kea, but it had just guessed the parameters rather than looking around the rest of the Forge codebase for proper ones—of course, it guessed wrong. The steps used for testing were also a bit off. They looked more like a DHCP client than a test that tried to poke around to break some things. I was able to fix some things by nudging AI to refactor it or using "chat" to push it in another direction. Then, the "old marriage" phase kicked in. For some context, we use a function with parameters to start Kea: "srv_control.start_srv('DHCP', 'started')". Also, the same function with different parameters to stop Kea: "srv_control.start_srv('DHCP', 'stopped')". "Cursor" wanted to use "srv_control.stop_srv('DHCP', 'stopped')", which is a function that does not exist (every IDE would know that based on the imports). I politely asked "chat" to correct it because it does not exist. AI told me that I was wrong and that I could not use the codebase correctly, and the proper function is "srv_control.stop_srv" (that, as I stated before - does not exist). Any attempt to fix this "automatically" failed because AI was "hallucinating" even more into functions that do not exist but would be "logical in the English language." "Hallucinating" was the biggest problem—when AI was wrong, it was very confident in its rightness. We started arguing on the "chat". And it is not wise to argue with an inanimate object...
This experience fortified my feeling that we should immediately stop using "AI" name for this technology and use the proper naming: "Predictive Large Language Model"
At this point, I was confident that "AI" would not replace my job anytime soon, so I researched whether it was helpful for me. There are two things where this LLM shined like a diamond:
- Making some isolated functions. For example, a function that would send DORA X amount of times, randomly generating MAC address, and return what IP it got; calculate hex values of an option content.
- Predicting my next steps when writing code. (like the contents of a loop that I start to write; some repeating parts of code; writing comments).
Conclusion: Today, I'm switching from VSCode to Cursor to try incorporating LLM into my daily code writing. I'm sure it will speed up my work, but the question is how much. Another thing that we must consider is the privacy of our non-open source code when using "cloud" tools.
@andrei provided notes as a etherpad: https://pad.isc.org/p/andrei-ai-experiments-journal