Defining Mandatory Components for AI Systems Seeking Digital Public Good (DPG) Recognition

samj · November 28, 2024, 8:32am

I have made this submission to the community discussion on the Digital Public Goods Alliance (DPGA)‘s Proposal: Defining Mandatory Components for AI Systems Seeking Digital Public Good (DPG) Recognition:

Good morning @ricardomiron,

Thank you for the opportunity to comment on this important standard.

We would like to add our voices, particularly in strong support of the requirement for data, which we consider to be the ‘source’ of AI systems.

We note that the original author of the Open Source Definition (OSD), and Debian Free Software Guidelines (DFSG) on which it was based, also takes this position, arguing that the OSD can be used in its current form to assess the openness of AI systems. The data is required to assess and address security and ethical issues including fairness and bias, as well as to enable these models to function as the foundation for future generations. This is what our industry and those dependent on it have come to expect of Open Source over the past quarter century.

We further note that this is necessary but not sufficient to avoid the perpetuation of the status quo where “open AI is highly dependent on the resources of a few large corporate actors [and their paid agents, including the Open Source Initiative (OSI)], who effectively control the AI industry and the research ecology beyond” (Nature: Why ‘open’ AI systems are actually closed, and why this matters).

Given the OSI’s willingness to release the OSAID 1.0 without achieving community consensus, and indeed in the face of sustained opposition from ourselves and others (several of whom have already been noted above) and presence of conflicting standards including the OSD itself, as well as work being done by the Free Software Foundation, Debian, and of course DPGA, which do all require the data, we recommend against referencing the OSI or OSAID in DPG standards.

For safety and stability, we propose that industry actors instead refer directly to the Open Source Definition, ideally v1.9 specifically or at least prefer terminology like “OSD-compliant terms” over “OSI-approved license”, unless and until the community achieves clear consensus on a future version, and we have recently launched the Open Source Declaration to that effect. The OSD covers all software, while the OSAID conflicts with it for any software that “infers, from the input it receives, how to generate outputs” (i.e., almost all software). It is therefore likely that OSI’s leadership will attempt to “harmonise” the two definitions when they return from post-launch vacation, no doubt in a fashion that also “differs in significant ways from the views of most software freedom advocates”.

We further note that respected industry experts argue that it is not feasible to apply Open Source to AI, likening the OSI’s OSAID to the failed Tacoma Narrows bridge, and that it is in any case too soon to do so, the OSD itself having been the culmination of decades of work predating the OSI’s incorporation. For example, at the time of writing large models have “hit a wall”, the performance of smaller models is rapidly improving, open datasets are being regularly released, and copyright questions are working their way through the courts, all of which trend towards the requirement for openness in data.

While the OSD has proven itself strong on openness over the past quarter century, it is not explicit in the completeness dimension, which has caused problems dating back to id software’s release of Quake without the data the year after it was launched. The Linux Foundation’s Model Openness Framework (MOF) does achieve a higher standard in terms of completeness for AI only (which is yet to be tested), but it fails on openness, with the highest Class I accepting data under “any license or unlicensed”, suggesting the need for a Class 0. The OSAID accepts any data or no data, failing in both dimensions. I have discussed this in more detail in Openness vs Completeness: Data Dependencies and Open Source AI.

A better long-term solution may be to bugfix the OSD itself by making completeness explicit for all data, which has been proposed for a potential future version, but we our community has no stronger claim to consensus than the OSI (except for the absence of objections).

In any case, this demonstrates that the MOF is no more suitable for referencing than the OSAID. Furthermore, the role of a checklist or framework may replace that of an OSD-compliant license, allowing for a spectrum of openness from a minimum acceptable standard (ideally defined by that single existing and proven document) to radically open options. Referencing the specific proposals of select organisations in this context may deprive us this critical flexibility, which is a point we would have hoped emerged from the OSI’s multi-year process (and which suggests they may have plans to offer a single checklist and possibly a centralised certification program instead of the self-service status quo).

We trust this input will help bolster the case for data being a critical component of any such definition, being the “source” or symbolic instructions for AI models. It may be too soon to commit to any checklist for completeness, in which case it may be better to opt for generic tests for reproducibility (which is an implicit requirement for Open Source despite claims to the contrary).

Sincerely,

Sam Johnston (LinkedIn)
Developer, Debian
Lead Developer, Personal Artificial Intelligence Operating System (pAI-OS)
Board Member, Kwaai Open Source AI Lab