Sunday, May 13, 2018

Data Structures

Hi folks, I haven't posted any new projects for a while, and I've got a few waiting!
What I'd like to write about today, is my set of data structures.  The big idea here is that concepts are easier to explain and understand when you boil them down to their essence, and make them as concrete as possible - it forces you to think about what you really mean by them.  And because of that, it also makes it easier to explain to other people - so these are designed to be educational toys.
At the same time, data structures are a key part of computer science.  Writing good algorithms usually comes down to how you structure the data they need to work on, and good algorithms allow us to solve problems cleanly, elegantly, and fast.  They're a key enabler for how computers have changed the world, and they will only become more important over time.

The idea of the pieces is that each structure has a starting point (the purple ring), a number of arrows (which act as references), and a number of brown cups.  The "data" is represented by the plastic balls, which can be put in, taken out and moved around.  These are removable, which shows that the structure doesn't depend on the data objects themself, it just organises them.
So, the structures I've made, and what they tell us: 

Linked list


One thing you definitely need to do with a set of data is look through it.  Linked lists are very good if you just want to look through all the objects you've got in order.  You start at the initial point, and follow the arrows to look through the items.  This is great, and very flexible, and makes it easy to add new items to the list.  What it's very bad at is if you want to find a specific object - then you might need to look through the whole list to find it.
So this is good if you're going to need to add and remove objects a lot, and if you mostly care about looking through them in order.

Array

So next up we have an array!  This time you have a rigid structure, with references at fixed intervals along it.  This means you can still look through the objects in order easily, but if you want to go to a specific entry (say the fourth), you know how far to go down the rod to find it (four steps), so you can go to that location in one step, without having to go through each of the other entries.  The down side is that if you want to add new entries, you might need to do a lot of work moving other entries around, and of course the array is a fixed size so if it's too short you'll just have to make a new one (and if it's too long, it's just wasted space.
So this is good if you've got a set of things which won't change much, and you want to pick out specific entries.

 Binary tree

But we can do better!  Suppose our data items come with some kind of order - here I've ordered the balls by colour, red to purple.  If we want to find a specific one now, we can start at the top and at each step ask, is the current one our item?  If not, is it further towards red than ours?  If so, go left.  Or is it further towards purple?  In that case, go right.  Clearly that's more complicated, but you can see that here we've got 8 objects and can find any of them in at most 3 steps.
I wanted to get across too that balance is difficult for binary trees.  Not all the branches have the same length, and you can see the one on the left only has one branch - this points to the problem you have with adding and removing entries in a tree, that you might have to do some extra work rebalancing the tree, or you might accept some inefficiency if you let it get out of hand.

(One structure I'd like to make in future is a red-black tree, to tell the next step in this story.  They're a wonderful concept, and already beautifully graphic!)

Hash table


The last of my data structures for now is a hash table - the idea now is that if you have a kind of summary of your data object (a hash function, here I'm using the colour of the ball), you can arrange it so that to find it, you just need to follow the arrow of that particular colour, and again this can be done without needing to look through all the entries.  But then we have an awkward question of what we do if two data items have the same colour (hash collisions!) - to be able to store both we need to fit both into a "hash bucket".  I'd hope this could lead to an interesting discussion of why a good hash function is important, and how using this structure efficiently can need us to think a bit about the distribution of our data.

So those are my data structures!  There's a couple more of these I'd like to do, and I am planning to write up a pattern for these.  I'm also interested in other data structures people would like to see - I've had requests for a 2D array, and for a ring buffer, which are both interesting challenges, so I'll have to see if I can come up with a way to represent them!

Happy knitting,
Hugh.