Little things can get you into big trouble.
This has been true for all human history. One of the most famous descriptions of it comes from a proverb centuries ago that begins “For want of a nail the [horse]shoe was lost…” and concludes with the entire kingdom being lost “…all for the want of a nail.”
Here in the 21st-century world of high-tech, it’s less about horses and riders and more about tiny defects in the software that runs just about everything. Those can lead to everything from inconvenience to catastrophe too.
And now, with the rise of artificial intelligence (AI) being used to write software, it’s the snippet that can get you in big trouble. Which is why, if you’re going to jump on the AI bandwagon, you need a way to protect yourself from using them illegally–something like an automated snippet scanner. More on that shortly.
But first, the problem. A snippet of software code is pretty much what it sounds like–a tiny piece of a much larger whole. The Oxford Dictionary defines a snippet as “a small piece or brief extract.”
But that doesn’t mean a software snippet’s impact will necessarily be small. As has been said numerous times, modern software is more assembled than built. The use of so-called generative AI chatbots like OpenAI’s ChatGPT and GitHub’s Copilot to do much of that assembly using snippets of existing code is increasing exponentially.
According to Stack Overflow’s 2023 Developer Survey, 70% of 89,000 respondents are either using AI tools in their development process or planning to do so within this year.
Much of that code is open source. Which is fine on the face of it. Human developers use open source components all the time because it amounts to free raw material for building software products. It can be modified to suit the needs of those who use it, eliminating the need to reinvent basic software building blocks. The most recent annual Synopsys Open Source Security and Risk Analysis (OSSRA) report found that open source code is in virtually every modern codebase and makes up an average of 76% of the code in them. (Disclosure: I write for Synopsys.)
But free to use doesn’t mean free of obligation–users are legally required to comply with any licensing provisions and attribution requirements in an open source component. If they don’t, it could be costly–very costly. That’s where using AI chatbots to write code can get very risky. And even if you’ve heard it before, you need to hear it again: Software risk is business risk.
Generative AI tools like ChatGPT function based on machine learning algorithms that use billions of lines of public code to recommend lines of code for users to include in their proprietary projects. But much of that code is either copyrighted or subject to more restrictive licensing conditions, and the chatbots don’t always notify users of those requirements or conflicts.
Indeed, a team of Synopsys researchers flagged that exact problem several months ago in code generated by Copilot, demonstrating that it didn’t catch an open source licensing conflict in a snippet of code that it added to a project.
The 2023 OSSRA report also found that 54% of the codebases scanned for the report contained licensing conflicts and 31% contained open source with no license or custom licenses.
They weren’t the only ones to notice such a problem. A federal lawsuit filed last November by four anonymous plaintiffs against Copilot and its underlying OpenAI Codex machine learning model alleged that Copilot is an example of “a brave new world of software piracy.”
According to the complaint, “Copilot’s model was trained on billions of lines of publicly available code that is subject to open source licenses–including the plaintiffs’ code,” yet the code offered to Copilot customers “did not include, and in fact removed, copyright and notice information required by the various open source licenses.”
Frank Tomasello, senior sales engineer with the Synopsys Software Integrity Group, noted that while that suit is still pending, “it’s safe to speculate that this could potentially be the inaugural case in a wave of similar legal challenges as AI continues to transform the software development landscape.”
All of this should be a warning to organizations that if they want to reap the benefits of AI-generated code–software written at blazing speed by the equivalent of junior developers who don’t demand salaries, benefits, or vacations–the chatbots they use need intense human oversight.
So how can organizations stay out of that kind of AI-generated licensing trouble? In a recent webinar, Tomasello listed three options.
“The first is what I often call the ‘do-nothing’ strategy. It sounds kind of funny but it’s a common initial position among organizations when they began to think about establishing an application security program. They’re simply doing nothing to manage their security risk,” he said.
“But that equates to neglecting any checks for licensing compliance or copyright issues. It could lead to considerable license risk and significant legal consequences as highlighted by those cases.”
The second option is to try to do it manually. The problem with that? It would take forever, given the number of snippets that would have to be analyzed, the complexity of licensing regulations, and plain old human error.
Plus, given the pressure on development teams to produce software faster, the manual approach is neither affordable nor practical.
The third and most effective, not to mention most affordable, approach is to “automate the entire process,” Tomasello said.
And that will soon be possible with a Synopsys AI code analysis application programming interface (API) that will analyze code generated by AI and identify open source snippets along with any related license and copyright terms.
The tool isn’t quite ready for prime time–this is a “technology preview” version offered at no cost to selected developers.
However, the capability will make it easier and much faster to make sure that when an AI tool imports a code snippet into a project, the user will know if it comes with licensing or attribution requirements.
Tomasello said developers can simply provide code blocks generated by AI chatbots and the code analysis tool will let them know if any snippets within it match an open source project, and if so, which license comes with it. It will also list the line numbers in both the submitted code and the open source code that match.
The code analysis relies on the Synopsys Black Duck(R) KnowledgeBase, which contains more than 6 million open source projects and more than 2,750 open source licenses. And it means teams can be confident that they aren’t building and shipping applications that contain someone else’s protected intellectual property.
“The most important aspect of the KnowledgeBase is its dynamic nature,” Tomasello said, noting that it is continuously being updated. “Typically, with snippet matching, five to seven lines of average source code can generate a match.”
Finally, and just as important, the tool also protects the user’s intellectual property, even though it’s scanning the source code line by line.
“When the scan is performed, the source files end up being run through a one-way cryptographic hash function, which generates a 160-bit hexadecimal hash that is unrecognizable from the source code that was initially scanned,” Tomasello said. “Once your source files are hashed and encrypted, there is no way to decrypt those source files back into their original form.”
Which will ensure that proprietary code is protected, not stolen.
To learn more, visit us here.