Deep learning and free software

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jake Edge

July 18, 2018

Deep-learning applications typically rely on a trained neural net to accomplish their goal (e.g. photo recognition, automatic translation, or playing go). That neural net uses what is essentially a large collection of weighting numbers that have been empirically determined as part of its training (which generally uses a huge set of training data). A free-software application could use those weights, but there are a number of barriers for users who might want to tweak them for various reasons. A discussion on the debian-devel mailing list recently looked at whether these deep-learning applications can ever truly be considered "free" (as in freedom) because of these pre-computed weights—and the difficulties inherent in changing them.

The conversation was started by Zhou Mo ("Lumin"); he isconcerned that, even if deep-learning application projects release the weights under a free license, there are questions about how much freedom that really provides. In particular, he noted that training these networks is done using NVIDIA's proprietary cuDNN library that only runs on NVIDIA hardware.

Even if upstream releases their pretrained model under GPL license, the freedom to modify, research, reproduce the neural networks, especially "very deep" neural networks is de facto [controlled] by PROPRIETARIES.

While it might be possible to train (or retrain) these networks using only free software, it is prohibitively expensive in terms of CPU time to do so, he said. So, he asked: " Is GPL-[licensed] pretrained neural network REALLY FREE? Is it really DFSG -compatible? " Jonas Smedegaarddid not think the "100x slower" argument held much water in terms of free-software licensing. Once Mo hadclarified some of his thinking, Smedegaardsaid:

I therefore believe there is no license violation, as long as the code is _possible_ to compile without non-free code (e.g. blobs to activate GPUs) - even if ridiculously expensive in either time or hardware.

He did note that if rebuilding the neural network data was required for releases, there was a practical problem: blocking the build for, say, 100 years would not really be possible. That stretches way beyond even Debian's relatively slow release pace. Theodore Y. Ts'olikened the situation to that of e2fsprogs, which distributes the output from autoconf as well as the input for it; many distributions will simply use the output as newer versions of autoconf may not generate it correctly.

Ian Jackson stronglystated that GPL-licensed neural networks were not truly free, nor are they DFSG compatible in his opinion:

In fact, they are probably not redistributable unless all the training data is supplied, since the GPL's definition of "source code" is the "preferred form for modification". For a pretrained neural network that is the training data.

But there may be other data sets that have similar properties, Russ Allberysaid in something of a thought experiment. He hypothesized about a database of astronomical objects where the end product is derived from a huge data set of observations using lots of computation, but the analysis code and perhaps some of the observations are not released. He pointed to genome data as another possible area where this might come up. He wondered whether that kind of data would be compatible with the DFSG. " For a lot of scientific data, reproducing a result data set is not trivial and the concept of 'source' is pretty murky. "

Jackson sees things differently , however. The hypothetical NASA database can be changed as needed or wanted, but the weightings of a neural network are not even remotely transparent:

Compare neural networks: a user who uses a pre-trained neural network is subordinated to the people who prepared its training data and set up the training runs.

If the user does not like the results given by the neural network, it is not sensibly possible to diagnose and remedy the problem by modifying the weighting tables directly. The user is rendered helpless.

If training data and training software is not provided, they cannot retrain the network even if they choose to buy or rent the hardware.

That argument convinced Allbery, but Russell Stuartdug a little deeper. He noted that the package that Mo mentioned in his initial message, leela-zero , is a reimplementation of the AlphaGo Zero program that has learned to play go at a level beyond that of the best humans. Stuart said that Debian already accepts chess, backgammon, and go programs that he probably could not sensibly modify even if he completely understood the code.

[...] Debian rejecting the example networks as they "aren't DFSG" free would be a mistake. I view one of our roles as advancing free software, all free software. Rejecting some software because we humans don't understand it doesn't match that goal.

Allberynoted that GNU Backgammon (which he packages for Debian) was built in a similar way to AlphaGo Zero: training a neural network by playing against itself. He thinks the file of weighting information is a reasonable thing to distribute:

I think it's the preferred form of modification in this case because upstream does not have, so far as I know, any special data set or additional information or resources beyond what's included in the source package. They would make any changes exactly the same way any user of the package would: instantiating the net and further training it, or starting over and training a new network.

However, Luo Ximin (who filed the "intent to package" (ITP) bug report for adding leela-zero to Debian)pointed out that there is no weight file that comes with leela-zero. There are efforts to generate such a file in a distributed manner among interested users.

So the source code for everything is in fact FOSS, it's just the fact that the compilation/"training" process can't be run by individuals or small non-profit orgs easily. For the purposes of DFSG packaging everything's fine, we don't distribute any weights as part of Debian, and upstream does not distribute that as part of the FOSS software either. This is not ideal but is the best we can do for now.

He is clearly a bit irritated by the DFSG-suitability question, at least with regard to leela-zero, but it is an important question to (eventually) settle. Deep-learning will clearly become more prevalent over time, for good or ill (and Jackson made several points about the ethical problems that can stem from it). How these applications and data sets will be handled by Debian (and other distributions) will have to be worked out, sooner or later.

A separate kind of license for these data sets (training or pre-trained weights), as the Linux Foundation has beenworking on with the Community Data License Agreement , may help a bit, but won't be any kind of panacea. The license doesn't really change the fundamental computing resources needed to use a covered data set, for example. It is going to come down to a question of what a truly free deep-learning application looks like and what, if anything, users can do to modify it. The application of huge computing resources to problems that have long bedeviled computer scientists is certainly a boon in some areas, but it would seem to be leading away from the democratization of software to a certain extent.

( Log in

to post comments)

Recommend

Felix Programming Language

I created the exact same app in React and Vue

Opera上市，周亚辉的浏览器战争能掀起风浪吗？

性取向被识别后，AI伦理的边界又在哪里？

GitHub - grumpyhome/grumpy: Grumpy is a Python to Go source code transcompiler a...

Azure Kubernetes 服务是如何让开发者更有成效？

go 语言学习(1)

golang之sync.Mutex互斥锁源码分析

强烈推荐：绝对是最好的一个小程序开源框架

GitHub - kitao/pyxel: A retro game development environment in Python

About Joyk