Pigs, Houses & why Machine Learning is Different

Rob Farrow, Head of Engineering at Profusion

Spoiler — if you are sick of machine learning steer away now, but if you’re a fan of pigs, I can (maybe) make it worth your while.

So you’ve probably read all about machine learning (ML) by now and how it’s going to take over the world, take our jobs, automate everything, play video games better than you, steal your girlfriend, bake better cakes than you, have more friends than you, and eventually cause judgement day. It’s not exactly a short list of promises.

I think they’re half right, if anything is sure, it’s progress. Nothing will impede progress for long, and things will get better, smarter and quicker. This is a universal truth, and anyone who says otherwise is denying a truth self-evident. This is the core reason that developers people shouldn’t get comfortable, constant adaptation is key.

But, do you really understand what machine learning is? I’d argue probably not. I’d argue I don’t really either, yet here I’m going to try and rationalise it, at least my understanding of it anyway, with the most ridiculous analogy I can think of ‘Pigs & Houses’. I propose that ‘normal’ software is like a house, and ‘new fangled ML based software’ is like a pig.

So what is a house?

Old code is like putting bricks together, brick by brick or function by function, you construct what you need. The ultimate goal is to build a house, you build some walls, the foundation, the roof, add some plumbing, electrics, interior walls.

The framework you use is your underlying structure, the foundations are the language itself, the shape of your construction is your program logic flow. Perhaps you have a function to add numbers together — that can be your light switch, maybe you’ve split your house into smaller units — with doors! Like a microservice based architecture… ok I’m getting carried away but you get the point.

I would love to say there is no ambiguity, yet there’s always some, but really people have been building houses for centuries, we’ve got good at it (ish). There will be some ambiguity always, whenever we do anything, what kind of balcony fits here, perhaps the kind of stairs are a bit of an unknown, what cupboards are going where?

There will be things you realise you want to add later on, perhaps a car charger, or some extra cupboards, or perhaps this space is unused and you need to modify it. That’s the beauty of it, shit changes when you hit prod, or when you live in it. The first pass will never be perfect, there will always be things to change, and the further down the build you get, the harder those changes become.

So this is old code, it is known, it is procedural and predictable. We’ve done it before, and we’ll do it again. It isn’t going anywhere anytime fast.

Wait — but then — what is a pig?

One of these.

What seems like a ridiculous question really isn’t, when I say new code is like a pig, what I actually mean is that the ambiguity is a lot higher. The inner workings of a pig are not truly known, and were more discovered than they were created.

A pig is a living being, it’s a thing that runs on its own rules, and those rules aren’t fully known. It needs care, it needs inputs, and it certainly has outputs. It can be trained, and it will learn, and of course, it is a function of its environment. Really this could be any animal, but I think pigs are cute.

So machine learning is more like something we discovered, we have fields of research dedicated to understanding it, same as pigs — we have hordes of dedicated people who work on understanding their inner workings (I’m sure there are people who’ve dedicated their entire careers to understanding pigs).

But old code is more like something we’ve made, houses & code, both are systems designed by humans. Of course you can learn it, but it’s not something entirely unknown by people, it’s a man-made system.

This is why machine learning is more like a pig.

When you build a machine learning based system, you don’t tell it what to do really, you tell it what to aim for, but you don’t tell it how. Let’s take a simple machine learning algorithm, perhaps a decision tree style one, that decides whether or not someone is older than 30, yeh let’s do some logan’s run style machine learning, that sounds fun. For what purpose I’ll leave to your imagination.

Definitely before my time…


Real simple, you manually flag people older than 30, it’s a field – let’s call it target — in some data that says true or false. Literally a column that you add manually. Now you feed it all the data you have on those people, the idea is that it will try work out how to generate the answer in the target column you gave it. If in your data you have the age of those people, perhaps a column called age, your decision tree should realise that everything else you feed it doesn’t matter, only the age, but you didn’t tell it that, it worked it out.

Now in this ridiculous example, you have a decision tree like so:

So this is the pig. In order to run this algorithm, even if it was pointless, you needed to write some code to build it — in old world code. You needed to feed it, some food (data), and you needed to watch it, look after it and ensure it made the right decision.

Is your pig healthy?

So you have a pig, and you have some old world code supporting your pig, perhaps a barn? Your barn is responsible for providing the support to look after the pig, making sure it continually gets enough food, making sure it is learning the right stuff, and its outputs are correct.

I was going to put a picture of pig ‘outputs’ but I think maybe a barn would be better.

Your pig is trained to do a certain task, and you have a little bit of automation that flags up to you if the pig ever stops, or, starts to get it wrong.

PIGLOPS/MLOPS

Now in the real world, data changes, food changes. The point of having a pig is that it can adapt, you don’t just feed it one thing — actually I had to go google what pigs actually eat and the answer is apparently everything. Basically my point here is that data changes over time, so what goes into your machine learning model also changes over time.

So let’s say you get a new engineer in the team, they’re busy working away on the backend of the system, and they decide that you’re now going to be encoding the age of people in base-12. Fantastic, the PMs agree, the C-levels all sign off on it, it goes into jira and the people rejoice. You sit there shaking your head in disbelief, but you have no power here.

So the change goes in, and everyone’s age is now updated to base-12, the new cut off for a base-10 30 year old is now represented as 26. Shit.

Now your pig is wrong.

So what do you do?

The answer is obvious. You cryogenically freeze your pig, and get a new one and retrain it. You essentially ‘save’ your old pig, but stop using it because it’s wrong, right? You get a new one in, train it on the new number 26, and sub that one into your barn.

Why bother? Why not just send the pig to the butchers? What if you want to know a decision your old pig has previously made? In fact, if your data (food) changes a lot, perhaps you want to know what decision would have been made at any point in time?

Super important if you’re using this model in a way that materially changes people’s lives, like deciding on whether someone gets a loan? Or deciding on the person in question will be reincarnated for another hedonistic lifecycle in the run.

Doesn’t really matter quite so much if you’re trying to work out whether to promote quavers or monster munch in an email marketing campaign though.

This is machine learning. You need to not only run your model in production, but you need to monitor it against known outputs, flag up and retrain when things deteriorate, you need to version it when things do change, so you can have an auditable record of what happened when, and just maybe — why it made those decisions.

It represents a fundamental paradigm shift in how we talk about designing solutions, not everything can be done with machine learning, it’s great for certain classes of problem. Like the pig, its inner workings are quite hard to work out and truly understand, not impossible, just very difficult.

My inspiration for this post was this article: Software 2.0, I heartily recommend a read!

Previous
Previous

ChatGPT and the Human Condition: Navigating the Intersection of Artificial and Organic Intelligence

Next
Next

Earth Day 2022