Deep Dive into the magical world of Data.

Bit to GeoByte

paul.godfrey@primenetuk.com 05 August 2021
A data lake is a centralised repository that allows you to store all your structured and unstructured data at any scale. As well as a method for organising large volumes of highly diverse data from diverse sources. | Every day, over 2.5 quintillion bytes of data is generated across the world and it is expected to grow 5.2 zettabytes by 2025. Techacute July 2021.

I have been working in Primenet for a few years now and the term Data Lake comes out in more and more conversations with clients and storage vendors alike. I thought I would dive into the current state of play in the digital transformation and look at how the data is now physically stored and accessed and more so used in analytical situations to spot trends and power the world of Artificial Intelligence.

Back in the old day's bits and bytes were stored on tape and locked away in vast repositories only to be hauled out of storage when something went wrong, and you needed the tape back up to restore data. Oh, how we have moved on from those days. Although talking to a data centre partner. Tape is still a thing for cold dead storage, but the data stored in those days was never fully utilised. Compliance is a bit tight for a few verticals like Legal and Medical having to store client and patient records for up to 7 years and in some cases indefinitely.

I remember sitting down with a practice director at quite sizable solicitors about 10 years ago now and talking to them about digital backups to tape and they mentioned they kept all their records on paper, and they used three shipping containers to store the records in.
Madness, right? I finally got them into scanning said documents onto a Network Attached Storage drive and back them all up to a tape drive. Much safer and easier to restore if required.

Tide of change

Data has really moved on since then and the scale of the data is producing this new term Data Lakes. So, a Data Lake is a storage repository that holds vast amounts of raw data in native format until it's needed. It is held in a flat-file architecture as opposed to say a data warehouse that stores the data in a more folder-based structure, or the term “structured data” is applied. The beauty of a data lake is storing raw data in a flat file it is assigned a unique identifier and meta tagged. This allows queries to be made over the whole data set and then pulled out smaller queries and analyses hence the buzz word Big Data was formed as we could query vast lakes of raw data and start to see trends and nuances that benefit science and business alike.

Data Lakes and Data Warehouses are two completely different data storage strategies for holding and working with Big Data. The beauty of a Data Lake is it can hold both structured and unstructured data. Whereas a Data Warehouse primarily will hold Structured data only. This makes the use of a data lake agile in query, and data analytics companies utilise tools and clever methods of analysing the data and making business cases out of their findings. It is becoming a massive business and the corporate world has woken up to the analysis of such data lakes to find a business edge over the competition.

The only downside to the Data Lake Transformation is the management of them and when combining multiple data sets from various sources this can end up an unmanageable mess of sticky bits and bytes and soon you have a digital soup that is not fit for purpose. So great care must be put into the management of the Lake and there is a vast array of vendors from Microsoft Azure, AWS, and Google being the most prominent, but their others like Oracle and IBM and even the Chinese owned Alibaba Cloud has a massive repository.

Then you might ask the question of what database is best in a Data Lake environment? The usual suspects are obviously Oracle, MongoDB, Redis, and Elasticsearch. But also, PostgreSQL and even Apache Cassandra.

I have not really broached on Hadoop yet. It is the Open-Source software framework at the heart of much of the Big Data and analytics revolution. It provides solutions for enterprise data storage and analytics with almost unlimited scalability. Since its release in 2011 it has rapidly grown in popularity and a strong ecosystem of distributors, vendors and consultants has emerged to support its use across the industry.

Linux open-source operating system modelled on UNIX.

Look at it like Linus Torvalds and Linux and then Red Hat came out with its supported version. The banks jumped on it as they could now get support for the OS. By the way, great book if you like that sort of thing. “The Cathedral and the Bazaar.” Get it on Amazon, (other stockists available) it is an easy read and maps out the history of the development of Linux in layman terms. It is very similar to Apache Spark bringing out a version of Hadoop that is supported.


Vendors

Although a Data Lake is not a specific technology, there are several technologies that enable them. Some vendors that offer those technologies are:

  • Apache -offers the open-source ecosystem Hadoop, one of the most popular data lake services.
  • Amazon - offers Amazon S3 with virtually unlimited scalability.
  • Google - offers Google Cloud Storage and a collection of services to pair with it for management.
  • Oracle - offers the Oracle Big Data Cloud and a variety of PaaS services to help manage it.
  • Microsoft - offers the Azure Data Lake as a scalable data storage and Azure Data Lake Analytics as a parallel analytics service. This is an example of when the term data lake is used to refer to a specific technology instead of a strategy.
  • HVR - offers a scalable solution for organisations that need to move large volumes of data and update it in real time.
  • Podium - offers a solution with an easy-to-implement and use a suite of management features.
  • Snowflake - offers a solution that specialises in processing diverse datasets, including structured and semi-structured datasets such as JSON, XML and Parquet.
  • Zaloni - offers a solution that comes with Mica, a self-service data prep tool, and data catalogue.


The Scale of Data

  • bit (b) 1/8 of a byte. A single binary digit.
  • byte (B) 8 bits. ... is equal to a letter or a letter or character on a keyboard.
  • kilobyte (KB) 1,024 bytes. ... or is a very short story
  • megabyte (MB) 1024 kilobytes. ... Is around 900 pages of plain text
  • gigabyte (GB) 1024 megabytes. ...Is the equivalent of 350 Images
  • terabyte (TB) 1024 gigabytes. ...Is equal to 40 DVD’s
  • 88TB This was the physical size of the entire internet in 1997…. Just to put things into perspective a fascinating video on the evolution of the internet and subsequent influx for the need for Data Lakes https://bit.ly/88TBevo
  • Petabyte (PB) 1024 TB ….. is around 9 billion pages of plain text. The human brain is said to hold around 2.5 Petabytes of memories Unless your Albert Einstein.
  • Exabyte (EB) 1024 PB… this is an extraordinary amount of data and is the foundation or starting point of most Data Lakes. Some scientists have estimated that all the words ever spoken by mankind would be equal to five Exabytes. Also, according to Eric Smidt ex CEO of Alphabet, is the size of the internet at present, but you know its ever evolving.
  • Zetobyte (ZB) 1024 EB… This was the who’s the daddy of Data. This is a vast number, way over a 300 trillion MP3’s but is now no longer the big kid on the block.
  • Yottabyte (YB) 1024 ZB… which is in access of 50 Trillion DVD’s to put that into perspective IBM put out Watson on the marketplace back in 2014 fan faring Yottabyte storage as a thing.
  • Brontobyte (BB) 1024 YB… On Brotobyte is the equivalent of one quadrillion terabytes of data, but even this has been capped.
  • GeoByte (GB) 1024 BB… yep it’s a thing. And at present is 1,267650,600228,229 401,496703,205376 bytes. Crazy number I know, some say that if they ever did consider making such a leviathan machine it would stretch round the globe many many times. Leaving its carbon footprint in the wasteland of the Arctic circle and the tundra of the Russian step.

Future of Data exploration

I have mentioned Apache Hadoop, but for me the opportunities are endless. Since the first release in April 2006 of this Java-based set of software utilities, and the subsequent development of them over the years must lead to this point in time. The ability to query vast lakes of unstructured data is going to revolutionise our lives.

I know it is a big bold statement, but there is a massive ecosystem of Open-Source projects going on right now, exponentially advancing the development of Data Analysis. It can basically query vast volumes of any datatype offering a structured view of raw data. The output of which gives a radically different view on trends and moves the goalposts on what we thought possible in the past.

For example

Data Lakes are being used right now for storing facial recognition and biometrics in China right now. Huge amounts of Data are being committed to real time integration and then stored for trends and policing later. On average a Facial Imprint stores 200k images the size of the population and an estimated to have over 567Million cameras Looking down on the Population. Source The data lake used to run the whole operation must be on biblical proportions. Not to mention a massive stain on human rights... but I’m not going down that path.

So basically, in summation Apache Hadoop for me, seems to be the future with its reliance on Open-Source projects makes this my root number one to data analysis. I have tried to keep this off the technical path but hope it has given you a little insight into how data is stored and what we can glean from these vast repositories of data called Data Lakes.


Paul Godfrey