The devil is in the data.
Industrial data is known to be one of the complex data forms. We experienced it first-hand building a Near Real-Time (NRT) Industrial IoT, contextual data streaming, and processing engine. We’re sharing our perspective of what constitutes the industrial data, what makes it complex and how we built a platform overcoming these challenges and providing an abstract seamless view ready for consumption.
There are 5 main properties of industrial data
- Dispersed Data
- Data Type
We will visit each of them in detail below.
This is data stored or streaming in multiple locations from multiple ways and formats. There are three main kinds of data in industrial world – manufacturing data, ERP or Enterprise data, and ecosystem data. Manufacturing data contains the present and history of the states of the machines (also called assets). This can be abstracted out to represent a process that is performed using a set of machines working in a fashion.
The manufacturing data, is often the most structured data, and can be ingested using one of the two ways
- Streaming this data from a Data Collection System (DCS) like DeltaV, Asset Framework like OSI Pi or PLCs etc. This is the real time stream of sensor data representing the process or condition or other variables. This could be happening in one of the 400+ ways of Industrial Connectivity including some known names like MQTT, OPC, OSI, ABDF1, Modbus etc.
- Ingesting data from historians and data lake. Most industries keep historical data for many years for auditing and compliance purposes. Machine Learning applications are giving a new use case of this data for training models or finding previously unknown insights. This data is present in various kind of historian systems or other data lake or data warehouse and needs to be fetched efficiently and quickly for visualising, statistical analysis or machine learning purposes.
Enterprise or ERP data, often contains meta information about the process. For ex – batch number that gets assigned. This data is necessary to create a full context and is fetched using one of the Enterprise connectors like SAP Hana, IBM, Salesforce. Enterprise data is typically semi structured, and the degree varies from one enterprise to other. Every enterprise is at their own stage in data maturity. The level of data maturity typically defines how easy or complex it is to fetch meaningful data from their existing Enterprise Systems or data lake.
There are many other software tools used for very specific purposes like Quality Management System (QMS) or a System of Records (SoR). They are ecosystem software, as they facilitate the necessary compliance, or bridging between various processes. Most often, it is needed to have a bi directional communication channel with these ecosystem software, using the connectors or APIs they expose.
The data obtained by any means is not necessarily in a format that can readily ingested or even processed. Almost all the time, certain cleaning and standardisation is required before or during the ingest. This step is also specific to the enterprise, or team or the particular process.
It is easy to think that most of the sensor data that is captured is numeric, for example – temperature, pressure etc. There are many other types of data – spectrometer, for example emits a spectral data per sample, infrared and regular cameras have video feeds. There are data fields like batch and lot number which could be alphanumeric as well. This diversity of data type has to be accounted at all times, which makes it interesting because they have different needs – memory, compute, storage, use cases. For instance, storing a temperature value at update frequency of 500ms for 24 hours take 1.28 MB. While storing a spectral probe’s (with 1000 wavelengths) at same update frequency and time period is 1536 MB (1200x). (How we’re storing them is another blog post, later). There is a similar increase of 10x – 100x in processing times of spectral data and video data as compared to numeric data types. This difference is important, because, these data represent the same event and in order to make meaning out of the event, the processing from both data types must be complete and analysed in union.
A key requirement, mostly for Manufacturing companies is GxP compliance. GxP is a Good Manufacturing Compliance, and amongst many things it means that the data is exact same and the results are predictable and repeatable. This has a very interesting intersection with how data is stored. It implies that the precision of floating points should not change. In these situations, data needs to be stored with arbitrary precision also called infinite precision. These requirement might limit the choices of storage systems. Let us look at this in a little detail.
IEEE-754 is the standard defined for floating point arithmetic. For many years this was a binary representation. Java’s float and double are IEEE 754 binary32 and binary64 respectively. These binary floating-point numbers are very efficient for computers to calculate, but because they work in binary and we work in decimal, there are some expectation mismatches. This is what is called as “Limited Precision Arithmetic”. For instance, 0.1 cannot be stored precisely in a double, and you get oddities like 0.1 + 0.2 turning out to be 0.30000000000000004. They are not a good choice for financial calculations, for instance.
Then there is “Arbitrary Precision Arithmetic” were there can be infinite precision (limited by the memory of the computer). IEEE, in 2008 updated 754 to add support for memory limited Decimal Floating Arithmetic. They are decimal32, decimal64, decimal128. Java had an implementation of Arbitrary Precision Arithmetic in the class BigDecimal before IEEE 2008 update. After 2008, a class MathContext was introduced that could specify the context (precision, base) of arithmetic. The whole idea behind this was to support financial kind of calculations with a finite memory.
Coming back to our situation, if we are using double data type, which is of “Fixed Precision Binary Arithmetic” we are doomed to encounter inconsistencies. So, what if we just change this to BigDecimal? After all, InfluxDB, Elasticsearch, Spark, Kafka etc is all in Java, right? Well, turns out that there is no support for BigDecimal in InfluxDB (BigDecimal SupportDIFFICULTY/LOW ) or Elasticsearch (Add BigDecimal data type:ANALYTICS/AGGREGATIONS ). Spark supports this with a DecimalType class (org.apace.spark.sql.types.DecimalTypes) (spark: [SPARK-26308][SQL] Avoid cast of decimals for ScalaUDFCLOSED ). There are a few gotchas although (https://issues.apache.org/jira/browse/SPARK-18484, https://carelesscoding.com/2019/04/09/spark-gotcha-2.html). Python – http://mpmath.org. TimeScaleDB (and its parent Postgres) does support this using NUMERIC data type. Similarly, MongoDB 3.4+ supports this via BSON type decimal which is an IEEE 754 decimal128 implementation. MySQL Supports this via DECIMAL and NUMERIC types.
This is the granularity at which data is fetched. Depending on the industry and use case, data interval could be very different. A high speed turbine might have hundreds of sensors streaming data at millisecond interval, while the sensors along a power transmission line would emit it in minutes or hours because they’re battery operated and are supposed to run in avid conditions for months and years. While these are two ends of the spectrum, there are numerous cases where data stream is typically in 0.5 – 3s range. Sometimes the speed of data pose limits on what existing solutions can be used. For instance, one might run in a situation where the quotas implied by the cloud iot hubs (or similar services) are limiting, and a convoluted route is necessary.
This level of detail plays another role in visualising. Most often people want to look at a big range of data, sometimes bigger than what can be fit into the browser or observed by naked eye. This can be achieved by performing sampling of incoming stream, and storing a copy of sampled data as well as raw data. Sampling of data, thus becomes an integral part of processing. Note that sampling happens on a per-tag (per-sensor) basis, typically during ingestion. There are many sampling algorithms like Simple Random Sampling, Stratified Sampling, Reservoir Sampling (all classes of survey sampling). It is important to ensure that certain interesting sections like peaks, dips, slope are captured correctly, because it helps visually. Non uniform sampling techniques like Level Crossing, Levels and Peaks sampling are commonly used. In some industrial scenarios, data may be noisy, and an appropriate noise correction might be necessary.
Signal Quality of data is another meta attribute, that is captured, and associated with each data point for each tag. In a data driven decision world, the signal quality is a feature to determine the confidence in outcome. This is also abstracted at higher levels, wherein the overall data quality of a system is considered, instead of individual signal quality.
Data from same or different sources could represent the same concepts but structured differently. Most often the data at different places have contextual relationships amongst them, that becomes visible only in a unified contextual representation of data. What I mean is, that at the lowest level, all data is flat data. Different persona apply different lenses and make different interpretations based on their domain. For example, the operations and reliability teams at an industrial site uses OSI PI to store their process and condition data, while the Quality team might use a QMS to store the quality outcomes of various batches. Sometimes these teams might even call the same things differently, which is a case of terminological heterogeneity in data. Every enterprise has their own set of processes and ontology. This ontology should be brought in, merged with the flat data to create a context. This is what we call contexalization and the resultant data is called contextualised data.
This data should be easily consumable by the programmers, subject matter experts, shop floor teams and other personas in their terminology. This act of bridging together different information at different times of an event and to be able to correctly determine the state of events, based on specific domain is called ContexAlyzation.
Data Type and Detail are the factors determining the size of data generated. A bio reactor can have 12,000 sensors streaming numeric data at 10ms granularity. From our above metric about size, 24x60x60x2 data points is 1.2MiB, so 24x60x60x100 data points is 60MiB per day. For this bio reactor its, 60×12,000 = 703 GiB per day or 8 MiBps. Not quite enough, but due to compliance reasons you need to save this data for 7 years. That’s 1.8 PB. An enterprise might have hundreds of such bio reactors, streaming at ~ 1GBps. Compared to a power transmission line, where there could be same number of sensors, but streaming at 15 mins, the data is 24x60x60x(1/(50*60))x12000 = 328 GiB per day or 3.8 MiBps. Compliance storage is 0.8 PB. There could be thousands of such power transmission lines, making enterprise streaming at ~ 3.7 GiBps. Note that these are pure storage data sizes. There are additional storage due to sampling, and meta information, so actual storage is 1.8 – 2.2x of the above estimate. Similarly, actual processing payload and ingest payload is also 1.3 – 1.8x of the above GiBps values. These are two examples from different domains, but one can encounter a situation where these situations are present in the same enterprise.
These numbers can be very much appreciated, and a key factor about industrial data is relevancy. Most often, the most relevant data is real time data. Hence, all the ingestion and processing of this data must happen in real time or near real time. Once the data is juiced out, it is stored in layers so only the most relevant data is pulled out first. The historical data is used oftenly as reference benchmarks, as datasets for Machine Learning projects, or for auditing purposes. Thus, storage, retrieval of such data is a prime characteristic of industrial data.
We will discuss the design principles behind building an Industrial Data Ingestion and Processing Pipeline in an upcoming blog.