Skip to content
Article

Data Lakes and Data Science – Simplified

For a business segment that Markets and Markets, just recently, predicted to be worth $322.9 billion by 2026, one would think that this data is showing a lot of promise1. Sadly, only about 28% of the unstructured data, which accounts for 80% of the global data set, is analyzed – the 72% is what is pipped to grow the data science market at a 22.7% CAGR for the next four years.

The fact of the day, for just about every single day, is that we generate a lot of data. Terabytes, Petabytes, Exabytes, and Zettabytes, all are words that will get the immediate attention of data scientists and CXOs alike and will continue to do so, because the whole world is transitioning to data-centricity. From store shelves to autonomous vehicles, every single industry, and every single business is now at some stage or the other in harnessing the power of data. And those who aren’t are losing ground.

Another statistic of significance is the rise of the Data Lakes market, projected to grow to about $17.6 billion by 2026 from its current size of $3.74 billion2. If you notice, it’s not unexpected that the growth of data lakes is only slightly higher than the projected growth of the data science market. This similarity stems from the fact that all data science requires data, and beyond the typical business intelligence and analytics use cases, most statistical, machine learning, and artificial intelligence models largely use unstructured data, which requires a data lake.

What data lakes mean for businesses?

A hardware manufacturer, offering servers and storage for data lakes and data science applications, designed a global campaign, titled “The Data Centered” celebrating the achievements of the people and businesses that use data-driven decision-making3. This campaign is but a sample of why businesses are increasingly adopting data solutions to drive their business objectives and vision.

Data makes outcomes predictable. It is as simple as that. From product bundling to automating the testing of how tightly the screws holding server panels are screwed-in, data can identify the right one. As data is the foundation for analytics and AI, it has become imperative for businesses to invest in the storage, management, and utilization of data.
As innovative business models evolve around data, the average business has been forced to reimagine operations with an AI lens. Some of the transformative business cases for data lakes and data science include:

  • Behavioral insights gained from customer activity and purchasing behavior, along with the potential for dynamic product bundling based on buyer behavior
  • Social Listening can help businesses tailor their marketing and product/service strategy based on insights gained from social media chatter
  • Geographic and demographic analytics allows for differentiated strategies for different market segments based on statistical modeling of buyers into segments based on geography, demographics, and other variables

While the case for data lakes and data science is strong, there are issues to be considered. For Data Science to succeed, the Vision, the Strategy, and the Data Lake are all important success factors and are key for your data science initiatives to succeed.

Getting Started – Setting up the Data Lake

Getting started with your data lake is simple if you know what you want to do, how you want to do it, and what it takes.

  • What you want to do: Your core business case for aggregating and using data. To know what you want to do with data, we first look at what kind of data you have, what kind of data you’ll acquire. Once this information is at hand, we simply put ourselves to work on a wishlist of data-driven scenarios. Once you got this list you can prioritize by business need, technology/development requirements, and what other groundwork you need to be established before your Data Science use case can effectively be put to work.
  • How you want to do it: Here, we need to talk about the different technologies at hand. For this, we first list out our data sources, the kind of technologies they use, and whittle down to the technology options we have for setting up our data lake. For example, which cloud platform might work best? Perhaps, even a hosted data lake solution is possible, with some service providers offering data lakes as a service.
  • What it takes: One of the biggest challenges in technology transformation (including digital transformation) is dealing with the changes in ways of working. But that’s not the only challenge. For data science enabled decision-making to work, you also need the corresponding systems to be in constant conversation with each other. This level of integration also requires an enterprise-wide strategy, and more importantly, a vision of the future enterprise and how the organization would look like, going forward.

Getting on with your data lake

Now that you got your data lake started, it’s only one small part of the road covered. The journey is still a long, and arduous one, fraught with risks and pitfalls. Some of the biggest challenges is keeping a constant momentum going are:

  • Dreaming Big: One of the biggest causes of failure in Data Science, Analytics, and AI adoption is a very large, vague vision, as it does not define clear-cut goals and time-sensitive objectives. With businesses pursuing a vast variety of business cases, the true value of big data, data analytics, and AI are not seen fast enough. In turn, this lack of tangible outcomes increases the risk, and thus reduces the investment in these solutions.
  • Dreaming too small: As much as taking on a very large use case, or too many use cases simultaneously is a challenge, so is thinking too small. This is the other extreme. Taking too small or narrow-focused use case can reduce the applicability of the outcome, thus making stakeholders think twice about whether Data Science and a Data Lake are actually worth spending on.
  • Not managing change: When you are setting up your data lake and implementing data science prototypes and proofs of concept, it is important to get an internal buy-in. Fostering a culture of data-centricity and socializing the need for data science, and how it will help and not replace the human touchpoints are critical success factors for your data lake and AI initiatives to succeed.

Using a consultancy services provider can help you assess your maturity levels, and also provide an unbiased internal view of the data lake and what it means. Gaining this insight into what your people think of the new processes and technologies (Data Lakes, Data Science, ML, and AI) is key to overcoming future challenges in adoption of these use cases.

Solo or Safe?

Depending on your data lake and data science business case, roadmap, you can either build an in-house team, outsource, or take on a consulting or talent partner. While each approach has its advantages and risks, here are five aspects to consider while making your choice.

  • Team – building a team can take time, effort, and cost money. Consulting partners can get you started fairly quickly, thus allowing you a head start in your data science projects. Staff augmentation works similarly, giving you a team, but when you take an outsourcing approach, you get a team that usually has some experience working together, and also success working together, as partners have dedicated resource pools for projects
  • Time – unless you are extremely lucky, an in-house team is the most time-consuming to set up. Getting your use cases to production will be fastest with outsourcing, followed by staff augmentation, and consulting services.
  • Technologies – From choosing the right technologies to building the right technology capabilities is crucial for success. Here, each of the approaches can have an inherent bias. Consulting service providers unless expressly agnostic, usually favor a specific technology stack. Staff augmentation also tends to have a bias based on the collective technology experience of the team members. While building an in-house team from scratch can reduce this technology bias, it can also create a slightly heterogeneous resource pool from a technology experience standpoint.
  • Processes – Unless your data science leaders have extensive experience of their own, consulting services is the way to go, as providers can package and tailor industry best practices to make sure they are best suited to your business, marketplace, and industry.
  • Risk – Risk mitigation is a function of your talent, the technology stack, the processes, and the overall capabilities established as part of your Data CoE. The greater the collective experience and diversity in addressing diverse business cases, the lower the risk of failure.

Ultimately, this balance is not between risk and reward, but rather opportunity cost and cost. When you choose how you want to go about your data lake and data science initiatives, assess the cost of going solo, vs. taking a partnership approach and balance it against opportunity cost variables listed above. That would be your logical argument for or against.

Explore more articles