Modify/extend OSD2 (Source Code) for better clarity

Tobias · November 9, 2024, 2:54pm

I think this is a very nice amendment and covers the intended cases (including open source AI) well.

However, even though as programmers we permanently function in DRY mode, I think the message would be more clear if we would repeat the ‘2. Source Code’ in a data variant (‘2b. Data’) rather than forcing ourselves to read the entire ‘2. Source Code’ with the insertion in the introduction in the back of our mind and imagining what it would mean for data.

I made a draft below - feel free to shoot at it (or disagree with the less pretty, less clever approach of ‘duplicating’ content specifically focusing on the data).

2b. Data

In cases where software relies on data — including databases, models, or media — for its creation, modification, or operation, the program needs to include the data, and must allow distribution in the original form of the data in which the program relies on it (‘source data’) as well as distribution in derived forms. Where some form of a product is not distributed with source data, there must be a well-publicized means of obtaining the source data for no more than a reasonable reproduction cost, preferably downloading from the Internet without charge. The source data must be the preferred form in which a programmer would modify the data the program relies on. Deliberately obfuscated source data is not allowed.

giacomo · November 9, 2024, 4:13pm

Hi Tobias, one of the first option @samj and I discussed was really similar to your proposal.

The problem with that was how to get the wording right as subtle variations might lead to open washing AIs.

Here is the problem over which Sam and I brainstormed for several nights: weights are data.

They are data algorithmically derived from other data (the source data) just a binary is algorithmically derived from the source code.

When you realize that an inference engine is nothing more than a programmable virtual machine with a specific architecture (the topology of the "artificial neural network) that execute the weights, most of technical and legal issues around “artificial intelligence” simply vanish (for better or worse).

Unfortunately many powerful corporations fear such simple ground truth for its legal implications, so we are in a similar situation of Galileo arguing for an heliocentric model of our solar system in a world dominated by tolemaics.

So we cannot rely on people being able to distinguish weights and source data when a Meta argue that they actually comply with the definition by simply providing the weights.

So here my challange for you: try to reword your proposal in a way that is still general (apply to all kind of data, from maze maps of a dungeon games and their editable source, to weights of an AI and their full source datasets), and explicit (clearly distinguish the compiled form and the source form).

It’s a tought game, I know as I played it with Sam and almost got his eternal hate.

However it’s important to play it by the rules: I’ll be the evil corporate lawyer trying to argue that something less or something different from the full set of the data that influenced the final weights qualify as open source under your proposal.

Your mission is to write something that no one could possibly use to open wash something that is not open.

As for your current proposal, weights in a CSV are enough.

Also “original form” is a new concept that could cause ambiguities without really solving the problem.

For example, I as a Meta lawyer would argue that the memory dump of GPUs after the training together with a tool able to turn the dump in the CSV weights match such definition

Tobias · November 10, 2024, 9:56am

Challenge accepted, first humble proposal below and no hatred so far :

In cases where software relies on data — including databases, models, or media — for its creation, modification, or operation, the program needs to include the data. Whenever the data included is the result of applying code to data, the data on which the code was applied needs to be included as well and this principle applies recursively. The set of all data included following this principle is collectively called the source data of a program. Where some form of a product is not distributed with the source data, there must be a well-publicized means of obtaining the source data for no more than a reasonable reproduction cost, preferably downloading from the Internet without charge. The source data must be the preferred form in which a programmer would modify the data the program relies on. Deliberately obfuscated source data is not allowed.

giacomo · November 11, 2024, 6:35am

I like this formulation a lot.

Some possible improvements might be

s/programmer/practitioner/ as suggested by @samj in another thread
s/applying code to data/applying algorithmic transformation to data/ because it’s not yet well established that every piece of code (just like every piece od data) is executable by an infinite set of programmable machines
s/the program needs to include the data/that data is considered integral to the program/ because “needs” sounds a bit more vague

OSI board lawyer MODE ON

According to this recursive definition an AI builder would need to share all the data, including the temporary GPU states that led to the weights! It’s unreasonable! It would inhibit open source innovation for the poor developers at Meta or OpenAI?

OSI board lawyer MODE OFF

The point here is that while such details would make it trivial to verify and exactly reproduce the process, most of such data are in fact algorhitmically derived by the ones used in the step before. So we should find a way to not include in the source data intermediate steps exactly and completely derivable from the previous steps in the processing.

samj · November 11, 2024, 7:28am

@Tobias: Also worth mentioning that a design goal for @giacomo and I given the politics of the day was to minimise at all costs the additions/modifications to the OSD in order to maximise the chance of acceptance — every word should be “load bearing”. I’m not 100% happy with what we came up with, and it will no doubt need to be lawyered to death in due course, but it’s more meant as a proof of concept.

We could consider a longer/different text in future — I’d prefer OSD 2 be renamed from “Source Code” to “Source” for example and be written more generically to cover both data and code (as a type of data), and the repetitiveness of your proposal doesn’t sit well with me (per DRY).

Changes to the tried-and-tested OSD will be like a constitutional amendment though and subject to a very high bar of community consensus and support. Minimalism is key and we don’t want to break anything or scare the chickens.

Also, we should be using objective terminology like “actual form” rather than the subjective term “preferred form” per my proposal. This should not be necessary but it created a vulnerability they abused for the OSAID to totally redefine what it refers to (e.g. data information aka metadata rather than data).

giacomo · November 11, 2024, 9:07am

Moved to a new thread so that we can discuss the two proposals separately, without confusion (@Tobias feel free to update the title I made up)

giacomo · November 11, 2024, 9:20am

@samj to be honest I like the Tobias proposal a lot. Still to polish for sure, but a valid alternative to explore for our carefully crafted change to the introduction.

@Tobias the political question that Sam raised is deep: the Open Source Definition is not a legal bounding text. OSI has no legal or moral stand to impose any change to the OSD to developers, just like us.

The Open Source Definition can only stand on its own merits, that is how much creators (developers, artists, data scientists and so on) decide to share on terms that comply with it.

The minimalist approach is designed under the hypothesis that the open source developers are going to prefer a minimal change over a larger one.
However such hypothesis might prove wrong, who knows?

So I think your proposal deserve further exploration.

Tobias · November 11, 2024, 4:35pm

I agree with a minimalist approach, but to misquote Einstein: the changes should be as simple as possible but not simpler. The Minimalist amendment to cover data for completeness captures well what we want to achieve, but it also intends to change the meaning of Article 2. implicitly and IMHO leaves too much ambiguity and room for interpretation on the meaning of Article 2 for data specifically.

It gives the feeling of being an ‘unfinished’ change and if I put myself in the shoes of a reader assessing the new definition I have a feeling like ‘you can ask me to read this improved definition, but if you make a change, then please make a real change and not a change that only goes halfway’. In other words the change can also be too minimal to be satisfying.

The approach I would advocate would be minimalist in the sense that it does not interfere in the existing articles (in order not to ‘scare’), but only inserts additional language to tackle the ‘data issue’: an introductory comment to make the general principle clear and/or a dedicated Article to deal with the specificities of data. In my opinion it is acceptable to have both the
additional introduction to set the stage and the new Article, but if it bothers too much for DRY reasons, the Article can also stand on its own as indicatd by @giacomo (by calling it an alternative).

samj · November 17, 2024, 5:54am

My original suggestion actually modified OSD 2 directly, which to your point is likely the better approach, but I don’t like lists as there will invariably be things you leave off them (which is one of my concerns about the current proposal — a point @giacomo and I discussed at length).

Really the Open Source Definition should be generic enough to cover all use cases, but specific enough to be meaningful. For example, by using source instead of source code we can pick up data as well as code, but what about hardware (which was raised this week in another conversation)?

The devil’s in the details, but those details belong in the licenses, which could cover code, data, hardware, etc. as required. This would not expand the scope of Open Source in terms of what is expected (i.e., the ~~preferred~~actual form a ~~programmer~~practitioner would require to use, study, modify, and share a system), but it would expand its applicability into adjacent fields.

We do need to keep in mind what is nice to have (i.e., a refactored document) and what is actually achievable (i.e., a bugfixed document), and the OSI being off on its own crusade to water down the meaning of Open Source for AI with the OSAID, we can’t rely on them to drive consensus until that definition is repealed and the board replaced. I’m convinced that will happen now Debian/FSF/SFC/etc. have rejected it, but there is still the danger that they will succeed in making Open Source meaningless, and that they will run interference on any attempts to modernise the definition without their blessing. That is to say, for now at least I think we need to confine ourselves to changes that are a no-brainer for anyone (including their own membership) to accept and support. I’m not sure if that means we need 2 WIPs, one for fixes, and one refactored, but I think we at least need to focus on minimalist fixes until this self-induced storm passes.