Math with agency

I’m struggling a bit with what data is. :slight_smile:

Say that a statistics bureau publishes a dataset consisting of summary statistics (means, max, mins, medians, etc) of data from personal data on citizens, where the summary statistics is computed in such a way that the privacy of individuals cannot be compromised. Then, it seems pretty clear that if everything else in place, licenses, etc, then the resulting dataset would be open data.

We could take that a slight step further, where least squares has been applied to data and resulted in a slope and an intercept. Then, we have a linear model. Both the summary statistics and the linear model is a work, possibly a derived work, but a work nonetheless. And it would be data.

Then, we take a big jump to a deep neural network model. Conceptually, it has much in common with the very simple linear model, but I’m looking for exactly what sets it apart.

As long as it is just math, then it isn’t terribly interesting, because, to paraphrase Bruce Schneier way out of context, math doesn’t have agency, code has agency.

Somehow, there’s something with the DNN that gives it agency, to the extent that it can’t be trusted unless we have access to the training data, because we know that it will have adverse consequences.

This has been bothering me for some time, because of a feeling that there’s something I don’t understand here that seems important, and wasn’t answered in the OSAID process (which indeed left many questions open).

Is there something around the agency that is somehow given to this math that would need to be explicitly addressed in the OSD?

A simple linear regression also cannot be trusted: you also need the dataset to understand the model and including or removing a single point can also have large consequences. See e.g. Anscombe's quartet - Wikipedia

1 Like

Oh yes, a linear model can also be wrong in all sorts of ways, but that doesn’t necessarily make it not-open.

Thanks, Kjetil, for pursuing the discussion.

To quote George Box “All models are wrong, but some are useful”. The point here is not necessarily that the linear model is wrong (although you would not be able to prove it without the data), but that purely on the basis of the parameters I have only a very partial understanding of the model and almost no ways to study or modify the model. I have no clue for example of the uncertainty on the parameters, the goodness of fit etc. Only if I have the data, I can study the model and verify the assumptions of the model, investigate whether there is any bias in the dataset (and reflected in the parameters). I cannot just modify my parameters out of the blue to get a fit that is more robust to outliers. If I want to modify the model I need the data and refit using a different technique. Also, without the data I would not even be able to notice there are outliers.

One important difference compared to neural networks, though, is that the parameters of a linear model can still be interpreted, whereas the parameters of neural networks generally speaking have no ‘meaning’.

Yes, thank you @Tobias . I fully agree with what you say. I’m trying to explore what I think is an orthogonal concern.

However manipulative the publication of a linear model might be, it couldn’t insert a backdoor in executable code. You could be manipulated into inferring things that are wrong, but certainly, there is a level of agency here that is conceptually different between the linear model and an ML model?

I have very limited experience with inserting backdoors :slight_smile: but let me try to address this question.

When you fit a model (statistical model or ML model) you will typically write code in data science languages (R, Python) to fit the model. If you intend to share the model with the world for deployment, it can be an option to serialize your ‘model object’ in a binary format so others can start from it. These serialized binary objects can use language features (e.g. promises in R) to run arbitrary code upon deserializing / loading the serialized object in memory and this arbitrary code can do more or less whatever you want it to do (including retrieving files, inserting backdoors etc.).

R
https://www.cve.org/CVERecord?id=CVE-2024-27322
https://blog.r-project.org/2024/05/10/statement-on-cve-2024-27322/index.html

Python

Thank you both for your contributions and welcome to the open Open Source community :hugs:

I believe the agency comes from the code that “animates” what Simon Wardley refers to collectively as the symbolic instructions:

Either ALL the symbolic instructions are open (i.e. the code, the training data etc) OR it’s not open. OSI should be standing its ground here.

I also believe that applies regardless the number of dimensions involved, from a simple “alarm if temp > 100 degrees” type of AI (0D?) to an income vs experience model (1D), all the way to an LLM operating in massive high-dimensional space. If I’m dead because my fire alarm went off or because an LLM didn’t detect a fatal disease then I’m not bothered by the number of dimensions, and I don’t think we need to differentiate to answer the question.

True, but it’s a spectrum and I’m sure there are simple cases that are not readily interpreted too, so I’m again not sure this is a useful point of differentiation. I’d argue that in all cases you need the data regardless.

Backdoors deserve their own thread as they’re a serious issue enabled or at least exacerbated by the lack of data too.

Sam

1 Like

Fully agreed the difference does not matter. It is just that the less people can make sense of the parameters (and the more parameters there are), the more people will start ascribing ‘magical properties’ to the models as if the models are more than the result of applying code to data.

1 Like