Minimalist amendment to cover data for completeness

giacomo · October 23, 2024, 8:46pm

What is the smallest change that could be made to OSD to grant the four freedoms to users on data as well?

Reasoning about that, @samj and me realized that a single sentence, added to OSD introduction, may produce this effect without much issues.

In cases where software relies on data—including databases, models, or media—for its creation, modification, or operation, that data is considered integral to the program and is subject to the same requirements.

Everything else is left untouched, so when the trigger condition does not apply (the software does not rely on data for its creation, modification or operation), nothing changes.

When the software rely on data, such data is considered integral to the program and subject to the same requirements.

With data being “integral to the program”, all the people that create, select, annotate or otherwise process the data to create the program count among the programmers, whether they are pixel artists, musicians, data annotators, data scientists or others.

With data being “subject to the same requirements”, the rest of the OSD applies to them just like it applies to the rest of the program.

For example, OSD § 2. Source Code states that

The source code must be the preferred form in which a programmer would modify the program.

So depending on what part of the program we are talking about (code, media, ai weights…) the preferred form by the respective programmer (coder, media artist, data scientist…) to modify the program will change, but all of them must grant to users the four freedoms.

This way we can literally apply the well-known and battle-tested Open Source Definition to new kind of software, such as AI systems.

Will this cover databases?

Yes.

An open source database is a database released under terms and in a way that grant to every user the four freedoms. In practice, this means that the data must be shared in the preferred form to study and to modify them, and that their terms of distribution must match the Open Source Definition.

A well known example is OpenStreetMap.

Will this cover media (audio, video, and images)?

Yes.

An open source content is a content released under terms and in a way that grant to every user the four freedoms. In practice, this means that the contents must be shared in the preferred form to study and to modify them, and that their terms of distribution must match the Open Source Definition.

For example, if a video game includes a 3D short movie (the program), the corresponding Blender’s models may be the preferred form in which a 3D artists (the programmer) would modify the video and it will have to be shared under OSD-compliant terms.

Several well know licenses already grant to users the four freedoms, such as CC0, CC BY, CC BY-SA, Free Art License 1.3 and so on.

Will this cover AI (models, weights, parameters…)?

Yes.

An open source AI is an artificial intelligence whose model, weights, parameters, and configurations are released under terms and in a way that grant to every user the four freedoms. In practice, this means that the data that influenced its runtime behavior (including the training and the cross-validation datasets) must be shared in the preferred form to study and to modify them, and that their terms of distribution must match the Open Source Definition.

For example, the preferred form in which a data scientist (the programmer) would modify a large language model (the program), to reduce the size of the vocabulary, includes the whole training dataset and, if used, the cross-validation dataset: all the texts, the annotations, the random values used in the process, and so on…

It is important to note that OSD § 2 states that

Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

In the context of AI, this rules out synthetic data or partial datasets.
The source code of an AI is the set of data effectively used to create it, not a different one. No limitation to the freedom to study or to the the freedom to modify is allowed.

Anonymized data are allowed, but only if they were used in such form to train the AI in the first place.

Will this cover cryptographic keys?

Developers use cryptography to authenticate who distributes a certain file and to verify its integrity.

Whenever a public key is needed to verify the integrity or provenance of software or the preferred form of studying or modifying the program, it must be available to users, to grant the freedom to study.

However users never need your private keys to use, study, modify or distribute a program (or you are sitting on security warm-hole towards hell), and they don’t need to be shared.

In any case, if users need the programmers’ private keys to exercise any of the four freedoms, the software is not open source.

Will this cover… X?

As strange as your use case is, I think it is worth trying to see if this proposal can work to ensure the four freedoms for users.

So please brainstorm together the pros and cons.

rettichschnidi · October 27, 2024, 8:47pm

This might be too broad. Or how would one exclude metadata, i.e. build time stamps from becoming “considered integral”?

giacomo · October 27, 2024, 9:08pm

This is a great objection!

However note that this is a definition, not a license or a contract: it’s descriptive, not prescriptive.
No Judge will ever impose to share timestamp because of it.

And I can’t foresee a competent developer arguing that metadata such as build timestamps might
belong to the preferred form to study or to modify the software.

Yet if in a weird operating environment such data will effectively require build timestamps, such timestamp should be shared, I think.

Anyway, any suggestion to improve the wording?

giacomo · November 11, 2024, 9:05am

5 posts were split to a new topic: Modify/extend OSD2 (Source Code) for better clarity