Storing data persistently is something we all need to do, and there are a huge number of ways you can do that: text files, binary files, INI, CSV, XML, JSON, Excel, MDB, DB, or even some personal combination of those - there is a near infinite number of formats you could choose from. So why did you use that format?
Introduction
We all store data, and we all want that storage to be safe, reliable, and easy to use.
Background: The Scale Problem
Let's ignore computers for a bit and look at what happens in the real world when you start to scale things.
You and a mate start a company to sell New Improved Widgets and sometimes, you need to buy little stuff for the office: coffee, sugar, milk, paper, staples, the odd software package, pay for a taxi to a client - the "bits and bobs" that keep a company running in the background which aren't essential to the core business of shipping Widgets but without which the company doesn't work so well. Not a lot of money involved, but the company needs to pay for it (or you or your mate are out of pocket unfairly) and you really want to include it on the "expenses" side so you don't have to pay the government taxes on it. Well, you wouldn't want to give them more than they deserve, would you?
So you do what all small companies do: you buy a petty cash tin, and stick some cash in it. You need staples? Grab some cash, get them on the way home, put the receipt and change in the tin. Easy.
And that works fine: every month or so, you add more cash and put the receipts aside for the accountant to work out.
And the company does well. You now employ 5 people. That's ok - you trust them, pretty much - so all you do is make a tiny change: you have a key to the tin, and so does your mate. The company needs staples, you give Joe some cash, remind him about the receipt, and get the change back in the morning. It's the same system, it still works.
A year later, and you have 100 employees, and the system is creaking at the seams: the tin is now in the safe because it contains some serious money and you don't want it stolen while the company is closed. You have a book where you write down which employee you gave money to, how much, what it was for, and when you got the receipt and change - because otherwise, you'll forget who owes you the change. But you are pretty sure Mike in dispatch has altered it a couple of times and pocketed the difference and you don't trust him that much. Nothing you could prove, and he's good at his job, but ... let's just remove the temptation to steal from us, shall we? The book lives in the safe as well, so only you and your mate can edit it.
Ten years later and the Widgets business is really booming! 10,000 employees, twelve sites in eight countries, and ... a huge tin that takes up most of your time just dealing with. In fact, around 100 of those employees are devoted only to working with the different currencies in the tin and double checking every transaction because you could easily buy a nice house with the money that flows through it every month!
That's the scale problem: what works on the small scale becomes impossible to work with at a larger scale, no matter how much you wrap round it, no matter how much you hit the idea with a big stick. It's just doesn't fit the new circumstances.
Large companies don't use a petty cash tin!
Let's talk about files instead
It will probably surprise some of you to know that there is only one file type on a computer: binary data. Everything else is a "layer" on top of that provided by the software that processes the file.
Remember that: everything is binary data, even executables, text files, databases, images, JSON, XML, CSV. Even if a human can read it somehow (text, JSON, XML, PDF, HTML, CS, PHP, VB, ...) at its most basic, everything is binary, and the act of presenting it in a readable manner is down to interpretation of that binary data by an application.
Text files, for example, contain a restricted number of binary data values (the "printable" characters) plus a few special values called "control codes" which tell the reader where the lines end and where tabs start: Windows files for example use two special characters called CR and LF together to indicate the end of a line. If you have a file like this:
Then if you view it in a hex file editor, it looks like this:
The "new top line" is just the binary byte address within the line in hexadecimal, the data starts with "4C
" which is the hexadecimal value we interpret as an upper case "L
". Look along the line and you will see a "0D
" and a "0A
" - these are the two CR and LF special characters I was talking about. At the right is the "text reader" version, which replaces "unprintable characters" with ".
".
This is important: it means that there is no "line based" formatting that exists in a text file. Lines are an interpretation of the binary data, rather than an intrinsic feature of a text file. And that means that lines aren't stored to even take up the same amount of space in the file: a longer line means more characters to the left of the CR/LF and a bigger file.
And worse, to add a line between "Line 1" and "Line 2", we need to copy all the data up to the end of "Line 1" including the CR and LF to a new file, write our new line to the file with a new CR/LF combination, and then copy all the data after that from the old file to the new, before closing both files, deleting the original, and renaming the new. And that is exactly the process you also need to do to change "Line 1" to "Hello World!" If you just open the file for read/ write access, write a new line to it and save the file, you will overwrite parts of the second line with your new data.
You cannot insert or delete any data in a text file without rewriting the whole thing, it has no concept of lines!
That is nothing to do with the "scale problem" you were talking about!
Doesn't it?
It does, actually. When you use a text file for storing data: each item on a new "line" it works fine - it's easy to read by you and your application and it's pretty robust. The petty cash tin is working fine! But ... if you start to change what you do with the file of data, then it starts to complicate things, a lot.
Insert a new line between two existing ones in a text file, delete a line in a text file: both of these take some careful work: read the file up to the insert point (which isn't at a fixed location, so you have to search the data in the file to find it) while copying the data to a new file, insert your new line to the new file (for an insert) or skip up to the end of the line you want to remove (for a delete), then copy the rest of the original file over. Delete the input, rename the output. The petty cash tin is in the safe.
You can do it - it's not that complicated - but it's got to be done carefully, and if you are working with multiple users accessing the file your software has to cope with that and with the file potentially changing while you are reading it ... There is a whole department looking after the petty cash tin!
What does that have to do with my data?
Good question!
And the answer is simple: everything.
Text isn't particularly useful - it's fine for users to read, but it generally needs more structure if you want a computer to read it. So most applications don't use "raw" text files, they add a layer of abstraction that applications use to recover the data reliably:
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,
fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,
fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,
line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,
9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,
1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,
192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
[
{
"policyID": 119736,
"statecode": "FL",
"county": "CLAY COUNTY",
"eq_site_limit": 498960,
"hu_site_limit": 498960,
"fl_site_limit": 498960,
"fr_site_limit": 498960,
"tiv_2011": 498960,
"tiv_2012": 792148.9,
"eq_site_deductible": 0,
"hu_site_deductible": 9979.2,
"fl_site_deductible": 0,
"fr_site_deductible": 0,
"point_latitude": 30.102261,
"point_longitude": -81.711777,
"line": "Residential",
"construction": "Masonry",
"point_granularity": 1
},
{
"policyID": 448094,
"statecode": "FL",
"county": "CLAY COUNTY",
"eq_site_limit": 1322376.3,
"hu_site_limit": 1322376.3,
"fl_site_limit": 1322376.3,
"fr_site_limit": 1322376.3,
"tiv_2011": 1322376.3,
"tiv_2012": 1438163.57,
"eq_site_deductible": 0,
"hu_site_deductible": 0,
"fl_site_deductible": 0,
"fr_site_deductible": 0,
"point_latitude": 30.063936,
"point_longitude": -81.707664,
"line": "Residential",
"construction": "Masonry",
"point_granularity": 3
},
{
"policyID": 206893,
"statecode": "FL",
"county": "CLAY COUNTY",
"eq_site_limit": 190724.4,
"hu_site_limit": 190724.4,
"fl_site_limit": 190724.4,
"fr_site_limit": 190724.4,
"tiv_2011": 190724.4,
"tiv_2012": 192476.78,
"eq_site_deductible": 0,
"hu_site_deductible": 0,
"fl_site_deductible": 0,
"fr_site_deductible": 0,
"point_latitude": 30.089579,
"point_longitude": -81.700455,
"line": "Residential",
"construction": "Wood",
"point_granularity": 1
}
]
<root>
<row-0>
<policyid>119736</policyid>
<statecode>FL</statecode>
<county>CLAY COUNTY</county>
<eq_site_limit>498960</eq_site_limit>
<hu_site_limit>498960</hu_site_limit>
<fl_site_limit>498960</fl_site_limit>
<fr_site_limit>498960</fr_site_limit>
<tiv_2011>498960</tiv_2011>
<tiv_2012>792148.9</tiv_2012>
<eq_site_deductible>0</eq_site_deductible>
<hu_site_deductible>9979.2</hu_site_deductible>
<fl_site_deductible>0</fl_site_deductible>
<fr_site_deductible>0</fr_site_deductible>
<point_latitude>30.102261</point_latitude>
<point_longitude>-81.711777</point_longitude>
<line>Residential</line>
<construction>Masonry</construction>
<point_granularity>1</point_granularity>
</row-0>
<row-1>
<policyid>448094</policyid>
<statecode>FL</statecode>
<county>CLAY COUNTY</county>
<eq_site_limit>1322376.3</eq_site_limit>
<hu_site_limit>1322376.3</hu_site_limit>
<fl_site_limit>1322376.3</fl_site_limit>
<fr_site_limit>1322376.3</fr_site_limit>
<tiv_2011>1322376.3</tiv_2011>
<tiv_2012>1438163.57</tiv_2012>
<eq_site_deductible>0</eq_site_deductible>
<hu_site_deductible>0</hu_site_deductible>
<fl_site_deductible>0</fl_site_deductible>
<fr_site_deductible>0</fr_site_deductible>
<point_latitude>30.063936</point_latitude>
<point_longitude>-81.707664</point_longitude>
<line>Residential</line>
<construction>Masonry</construction>
<point_granularity>3</point_granularity>
</row-1>
<row-2>
<policyid>206893</policyid>
<statecode>FL</statecode>
<county>CLAY COUNTY</county>
<eq_site_limit>190724.4</eq_site_limit>
<hu_site_limit>190724.4</hu_site_limit>
<fl_site_limit>190724.4</fl_site_limit>
<fr_site_limit>190724.4</fr_site_limit>
<tiv_2011>190724.4</tiv_2011>
<tiv_2012>192476.78</tiv_2012>
<eq_site_deductible>0</eq_site_deductible>
<hu_site_deductible>0</hu_site_deductible>
<fl_site_deductible>0</fl_site_deductible>
<fr_site_deductible>0</fr_site_deductible>
<point_latitude>30.089579</point_latitude>
<point_longitude>-81.700455</point_longitude>
<line>Residential</line>
<construction>Wood</construction>
<point_granularity>1</point_granularity>
</row-2>
</root>
You can read that - pretty easily for the JSON, slightly less for the XML, and with a bit of difficulty for the CSV - and for an application, it's really easy to read or write any of them.
But inserting? Deleting? Updates? You have two options:
- Read the whole file into memory using a package that understands the source format and expand it to objects your app understands, then add your new object, remove or modify the offending object and then write the whole lot back to a new file using the package again.
- Or write code to read the formatted data and work out where your changes should be made (while writing data to a new file), do your modification and write that as necessary, copy the rest of the file over, close them both, and delete / rename, just as you did with a line in a text file, but with the problems on steroids! Your app needs to understand JSON (or XML, or CSV, or ... there are many formats) extremely well in code you have written to do that as well as how to insert strings into text files!
Neither option is efficient, neither option is simple, and both options are fraught with potential problems.
JSON and XML were designed to be data transfer formats, not data storage - and that's a very different idea. They are both formatted to be human readable and editable (to a degree) and to provide a programming language independent transfer of data. When you read JSON or XML into your app, it is converted into classes and variables that you can use as if the data was written in your app - regardless of what language your app is written in. So they "know" about integers, and floats, and strings, and dates, and arrays - all the basic elements that modern languages provide. And there are packages available for most modern languages to allow you to read and write JSON and XML in a couple of lines of your code.
But ... that's complicated, hierarchical data, and that's why the "to a degree" caveat in the previous paragraph. Yes, you can edit them manually - but it's horribly easy to mess up the whole file if you do! And inserting and deleting items by editing JSON or XML with your own app as "just text" is horribly complicated - your app needs to understand and be able to process JSON or XML itself, and that's a lot of work.
And that is the crux of the matter: if you want to modify row-based data, or table based data, then you need to use a format that is designed to work easily with such storage.
Brilliant! I'll use Excel!
Yes, you can store data in a spreadsheet. They are pretty good at it. Yes, they are row and column based. Yes, you can interface your app with the Excel engine. But ... when a spreadsheet starts getting big, they get slow - very slow - and difficult to work with. Why is that?
The answer is partly how they store data: an XLSX file is a ZIP archive containing a number of XML files which contain the data. So every time you insert a row, the appropriate XML file has to be rewritten with the new data, the whole lot zipped up again, and a new ZIP file written with an XLSX extension.
That's the JSON / XML / CSV approach all over again just slightly hidden from your view!
I'm as guilty of this as anyone: when I need a "quick data source" I reach for my spreadsheet and fill it in - I'll even read it into my app to process it. But when it gets big, or I need better processing, I convert to a DB and write a quick "input app" to load / change data because I know what problems I will face if I don't.
Which brings us nicely to the title of this article:
That's not a database, dude!
Or more accurately "A database is exactly what you need to use".
Databases don't store in text: they use a binary format that allocates memory within the file for rows, allows for that space to grow and shrink, release memory back to the pool when you delete a row, and a whole load of other things which make working with row based data as simple as "Insert this ...", "delete that ...", and "change the other to ...".
Your code interfaces with a DB engine (which can take a bit of getting your head round, I admit) and issues SQL commands which can be as simple as reading all the data in all the rows:
SELECT * FROM MyTable
Or add a row:
INSERT INTO MyTable (FirstName, Surname) VALUES ('Mike', 'Smith')
Up to some really complex stuff that combines data from a dozen tables into a single summary table.
It seems complicated, but the basics are really simple: W3Schools SQL tutorial covers it pretty well, and most modern languages / frameworks provide packages to access databases directly from your code in a simple and consistent manner.
TL;DR
What works for data on a small scale fails catastrophically when you try to expand it, particularly if the you started from a "that'll be easy to do" point of view.
Text files are great for initial values for your app and similar "small stable data" applications - .INI and .CONFIG are common.
Structured data is great for more complicated data apps where the data doesn't change much, if at all, and when it does, it's always just appending data to the end of an existing file - .CSV, .JSON, .XML, and .XLSX are all brilliant in their own way.
But if you know you need to change the data, remove bits, reorder data, or do anything much more complex than read it, then you really, really need to look at a database. It'll save you so much time in the long run!
History
- 1st December, 2022: First version
- 3rd December 2022: typos and an improved description, thanks to PIEBALDconsult.