4 Streaming Data and Data Streams
Taiwo Kolajo1,2, Olawande Daramola3, and Ayodele Adebiyi4
1Federal University Lokoja, Lokoja, Nigeria
2Covenant University, Ota, Nigeria
3Cape Peninsula University of Technology, Cape Town, South Africa
4Landmark University, Omu‐Aran, Kwara, Nigeria
1 Introduction
As at the dawn of 2020, the amount of the world data generated was estimated to be 44 zettabytes (i.e., 40 times more than the number of stars in the observable universe). The amount of data generated daily is projected to be 463 exabytes globally by 2025 [1]. Not only that, data are growing in volume but also in structure, in complexity, and geometrically [2]. These high‐volume data, generated at a high‐velocity, lead to what is called streaming data. Data streams can originate from IoT devices and sensors, spreadsheets, text files, images, audio and video recordings, chat and instant messaging, email, blogs and social networking sites, web traffic, financial transactions, telephone usage records, customer service records, satellite data, smart devices, GPS data, and network traffic and messages.
There are different schools of thought when it comes to defining streaming data and data stream, and it is difficult to situate a position between these two concepts. One school of thought defined streaming data as the act of sending data bit by bit instead of a whole package while data stream is the actual source of data. That is, streaming data is the act, the verb, the action while data stream is the product. In the field of Engineering, streaming data is the process or art of collecting the streamed data. It is the main activity or operation, while data stream is the pipeline through which streaming is performed. It is the engineering architecture, that is the line‐up of tools that will perform the streaming. In the context of data science, streaming data and data streams are used interchangeably. To better understand the concepts, let us first define what a stream is. A stream S is a possibly infinite bag of elements (x, t) where x is a tuple belonging to the schema S and t ∈ T is the timestamp of the elements [3]. Data stream refers to an unbounded and ordered sequence of instances of data arriving over time [4]. Data stream can be formally defined as an infinite sequence of tuples S = (x1, ti), (x2, t2),…, (xn, tn),… where xi is a tuple and ti is a timestamp [5]. Streaming data can be defined as frequently changing, and potentially infinite data flow generated from disparate sources [6]. Formally, streaming data
Table 1 Streaming data versus static data [9, 10]
Dimension | Streaming data | Static data |
Hardware | Typical single constrained measure of memory | Multiple CPUs |
Input | Data streams or updates | Data chunks |
Time | A few moments or even milliseconds | Much longer |
Data size | Infinite or unknown in advance | Known and finite |
Processing | A single or few pass over data | Processes in multiple rounds |
Storage | Not store or store a significant portion in memory | Store |
Applications | Web mining, traffic monitoring, sensor networks | Widely adopted in many domains |