Microsoft acquired GitHub back in 2018 for $7.5 billion, and since then has been integrating the code repository into its developer tools while maintaining a largely hands-off approach. However, writer, lawyer, and programmer Matthew Butterick has some issues with Microsoft’s machine-learning based code assistant, GitHub Copilot, and the way it is apparently mishandling open-source licenses.
GitHub Copilot works by offering “suggestions” for code completion as you type, and is a plugin available for Visual Studio and other IDEs. the AI based system is powered by Codex. But it’s the way the AI is trained, or more precisely from where it’s trained, that is becoming a problem for developers like Butterick.
According to OpenAI, the developers of Codex (which is licensed by Microsoft):
Codex was trained on “tens of millions of public repositories” including code on GitHub. Microsoft itself has vaguely described the training material as “billions of lines of public code”. But Copilot researcher Eddie Aftandilian confirmed in a recent podcast (@ 36:40) that Copilot is “train[ed] on public repos on GitHub”.
The problem here is that these public repos that GitHub is trained on are licensed, and require attribution when code from the repositories is used. Microsoft has been vague about its use of the code, calling it fair use, but Copilot can not only offer suggestions but emit verbatim bits of code, as shown by Texas A&M Professor and GitHub user Tim Davis:
@github copilot, with "public code" blocked, emits large chunks of my copyrighted code, with no attribution, no LGPL license. For example, the simple prompt "sparse matrix transpose, cs_" produces my cs_transpose in CSparse. My code on left, github on right. Not OK. pic.twitter.com/sqpOThi8nf
— Tim Davis (@DocSparse) October 16, 2022
For programmers like Butterick, who contribute open source code out of a sense of community, stripping any attribution away from their work is a problem:
Arguably, Microsoft is creating a new walled garden that will inhibit programmers from discovering traditional open-source communities. Or at the very least, remove any incentive to do so. Over time, this process will starve these communities. User attention and engagement will be shifted into the walled garden of Copilot and away from the open-source projects themselves—away from their source repos, their issue trackers, their mailing lists, their discussion boards. This shift in energy will be a painful, permanent loss to open source.
You can check out Butterick’s “GitHub Copilot investigation” for more information.