Dark Data, a term coined by Gartner, defines Dark Data as “The information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes – for example, analytics, business relationships and direct monetizing. Similar to Dark Matter in Physics, dark data often comprises most organizations’ universe of information assets. “
The amount of data being generated worldwide in 2020 alone was 44 zettabytes. This data has been increasing exponentially and it has grown almost 10 times in the last seven years. The data generation is due to continuous increase in the usage of digitally connected devices in all spheres of life, from mobile phones to connected cars to smart homes- you name it.
According to IBM Study, “ 80% of the data that is created is dark, unstructured data, data that the computers have developed in the last 40 years. Such data is not analyzed efficiently, and hence we miss 80% of the knowledge inside this data.” This percentage is expected to rise to 93% by end of 2020. Very soon, we will be left with only a small fraction of data which would be traditional (or relational) in nature
Why is the Data Dark?
Dark Data refers to data that cannot be analyzed on day to day basis for any analytics, for business process improvement or to identify new opportunities. There could be multiple reasons as to why the data could be classified as “Dark”
Unstructured Data: Due to the complexity that comes along with unstructured data, it makes it difficult to mine. Unstructured data presents its own set of challenges w.r.t data management, SLAs around data discovery and classification, lack of tools and infrastructure to process the data are some of the major roadblocks
Behind Firewall: A lot of an organization data that lies behind the firewall in terms of emails, documents, messaging, logs, notifications etc. are not used for analytics. This data is mostly text based and siloed in disparate databases in secure servers due to data compliance and confidentiality policies.
Lack of tools and Infrastructure: As can be seen, majority of Dark Data is high speed, high volume, unstructured data. If combined with images, audio, video then the complexity will only increase. A lot of the times, organizations:
- lack the necessary tools and infrastructure to store the dark data,
- lack knowledge of tools and software to process the data
- face cultural and people challenges to modernize the data platform or upskilling the staff. Even now, tools to process these complex data sources are only evolving and skillset is scarce.
Deep Web: While we rely on the likes of Google and Bing to make our everyday life easier to help us navigate as we seek answers to our daily queries, there is still almost 95% of data, which is not yet indexed by the search engines and hence, is not discoverable (per a research conducted by IBM). Medical and financial records, legal documents, government and organization specific data repositories are some of the examples of Deep Web
How is Dark Data impacting Industries?
Untapped data in present across the industry verticals, its more evident across some of the traditional industry verticals, like Manufacturing and Supply Chain.
As per the survey conducted by Gartner, 85% of the respondents felt that the Supply Chain presents significant complexity and grown challenge. Spread globally across distributors, suppliers and customers, this industry churns out data in huge numbers, of which only 5% is analyzed. There is ample opportunity for Big Data and related technology to be used in this domain.
In traditional pharma manufacturing, this data can be used to accelerate R&D activities. As the Sales Order is received and the order moves from Purchase Order to Port for Shipment, it moves through multiple departments (Sales and Marketing, Manufacturing, Supply Chain, Distribution). Similarly, a lot of untapped data lies in areas such as Real-World Evidence and Pharma covigilance to give accurate insights into how the drug is behaving outside the controlled environments of a clinical trial.
In Travel and Hospitality and Retail, for instance, getting the 360 view of the customer becomes extremely crucial to ensure customer loyalty, personalized marketing and better understanding of products and services being offered. However, a lot of the data generated is via verbal communications, on paper surveys, which although are stored but not utilized for improvement.
Using the data generated by IoT and connected devices is opening the doors for use cases like predictive maintenance and proactive alert and monitoring across the industries.
How to harness Dark Data?
As you have seen in the previous sections, there is a huge Digital Universe which is yet largely untapped and not being used for driving analytics and insights. Dark Analytics refers to the ability of using Dark Data for deriving intelligence and insights which the organizations can then use.
Embracing this data will require a “Data first” mindset across the hierarchy. What this means is that the companies have to change the ways they go about their data. The entire data journey has to be thought through, from capturing, processing, storage and consumption. Lot of the companies capture the data purely from a regulatory and compliance aspects, not necessarily for making it available to users. Hence, the data ends up being siloed and in disparate systems of their choice. This “data hoarding” results in huge volumes of data, not necessarily all useful and error prone at the same time. A “Data First” Strategy will allow for Data Democratization while ensuring Data Quality, Data Governance, and Data Security in accordance with the F.A.I.R (findable, accessible, interoperable and reusable) data principles.
On the technology front, thanks to the availability of Public Cloud Platforms (Amazon, Google, Microsoft) to name a few, with the ready to use service offerings both as PaaS (Platform as a Service) and SaaS (Software as a Service) , the journey to Dark Analytics is now much more smooth. With the technology continuously evolving, text mining, video analytics and speech to text are slowly becoming out of the box implementations. Google’s Video Analytics API, for instance, can now go through every scene in a video and identify specific elements in the scenes. A search engine can then be implemented to look through the video to identify specific features and when they show up in the video.
The high-performance compute and elasticity offered by Cloud, Cognitive Analytics and Pattern Recognition using Machine Learning is making it possible to use this data for analytics. Having said that, organizations will also have to ensure that the analytics insights presented can be trusted, is compliant and doesn’t invite cyber security threats.
Clearly, Dark Data is not structural or relational data. It is not unstructured data as well, as long as you are able to capture, classify and use it for insights. It might seem that this is synonymous to Big Data, but remember in Big Data analytics, we are talking about data (structured /unstructured) which we are collecting and using for analytics.
If you are an organization looking to use the vast potential of Dark Data, you need to start with the following steps:
Source Data Assessment: Analyze the points where data is being generated, not limited just to the ERP and the Point of Sales systems.
Data Platform Assessment: Understand the limitations and capabilities of the existing data platform, and whether it can help with data capture and extraction.
Consider Cloud: Cloud provides unlimited elasticity and flexibility as well as host of ready to use services which can be used as a starting point for extracting this data.
Identify Business Use Case: In order to see the gains and infuse stakeholder confidence, identify a use case for quick turnaround and value expected.
In today’s world, almost all business are data businesses. Organizations need to be able to capitalize and act on this data to stay competitive else risk getting obsolete. Companies like Google, Amazon, Facebook have been doing this for years now and hence are the leaders in this space. But it’s never too late to start. Data Modernization holds the key. At the same time, compliance, security, threats are areas which needs to be accounted for. If done right, this will open doors to a whole new world of opportunities.