Applied Data Science - New standard in data-driven business

Putting predictive models into production demands a streamlined workflow, and highly skilled data scientists and engineers. For the past few years, organizations have focused on developing strong proof of concepts and initial use cases. Now they must successfully introduce data-driven concepts into their daily operations. This year, data-driven companies will make productionizing data projects the priority and focus. Productionizing will increase exposure in other parts of the company, so data science teams will turn to proven methods, like microservices and APIs, to streamline their process. Data governance will become a standard part of any data project. Big Data technology has matured and is mostly enterprise-ready, increasing the pressure on and demand for data scientists. Major tech players like Google, LinkedIn, and Facebook will continue to open-source their product innovations in an early stage. In 2017, the gap between textbook data scientists and seasoned practitioners will widen further. Companies that retain, foster, and attract people who can apply data science in production will gain a significant competitive advantage.

From PoC to Production

If we look beyond the big data and data science hype, organizations are expected to transition their business by leveraging predictive modeling. Products need to be maintainable, easily accessible, and scalable. Models used in production will require data governance.

Reports and interactive dashboards are not enough to affect the bottom line meaningfully. Successful proof of concepts need to be taken into production. This can only be achieved when the business is willing to accept the implementation of data products.

Becoming data-driven is all about corporate agility and reliance on predictions from an advanced analytics model. Changing the course of an enterprise is a major operation that starts with a clearly defined vision and support from the management. When a product moves from the proof of concept stage (POC) to production, daily operations are affected. The involvement of other parts of the organizations requires stakeholders to grasp the potential impact. As my colleague Stijn Tonk, data scientist at GoDataDriven, says, “organizations will continue to change hierarchy by moving to more agile processes and multidisciplinary teams that focus on value for the end-customer”.

Streamlining Through Standardization

In the past few years many organizations have moved away from traditional data warehouses and databases to central data lakes. With a central data access point realized, in 2017 organizations will focus on standardization by wrapping data products in microservices and introducing standardized data workflows.

Getting the Organization Involved

Becoming data driven is all about corporate agility and reliance on predictions from an advanced analytics model. Changing the course of an enterprise is a major operation that starts with a clearly defined vision, and support from management (source: Big Data Survey 2016, www.bigdatasurvey.nl). When a product moves from the proof of concept stage (POC) to production, suddenly, daily operations are affected. The involvement of other parts of the organization requires stakeholders to grasp the potential impact.

Microservices

The microservices trend starts to make its appearance in data science, making models more easily accessible through APIs. Data scientists leverage containers to serve their models, so all software required to run a model will be available in the container.

Data workflows

Reproducible science and, thus, reproducible models, demand:

a standardized data science workflow, from data mining to model evaluation and monitoring.
workflow managers and schedulers, such as Oozie and Airflow, to replace scripts and manual processes, eradicating major error causes and improving model optimization.

Governance

When data products are used across the business, more people get access to the data within the models as well. The bigger the impact of a data product on the business, the bigger the data exposure is to your organization. It is no surprise then that data governance is one of the major topics nowadays.

Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise. It ensures that data can be trusted and that people can be made accountable for any adverse event that happens because of low data quality. It puts people in charge of fixing and preventing issues with data so that the enterprise can become more efficient. Who has access to data? What modifications have been made to it? Organizations must ask these two questions when taking a data product into production. What will happen, for example, when an extra number is added to IBAN? Where do changes need to be made?

Innovation in governance is coming from many sides and there is no clear winner within the Hadoop ecosystem. Atlas and Cloudera Navigator are the best-known solutions, one notable last addition to the landscape is Linkedin’s Wherehows.

Applied Data Science

Many data scientists have learned the theories of data science by attending courses or reading books. These learning methods provide little real world experience with the practical constraints and challenges that are inevitable to its application.

With growing business requirements, data science has matured and professional data science teams must familiarize themselves with its multiple facets, including:

modelling
experimentation
data pipelines
programming
infrastructures
architectures

The knowledge gap between self-proclaimed data scientists and experienced practitioners will manifest itself more clearly. More in-depth and practical training is needed to narrow this gap.

The demand for senior engineers and data scientists will increase dramatically, as these roles are vital for the development of robust pipelines and continuous improvements of production models.

Organizations need to rethink their benefit plans. Management usually leverages traditional benefits, like compensation and company cars. But the curious nature of data scientists compels them to look for less tangible benefits when seeking a job, such as organizational transparency, room to experiment and team members who possess greater skills and knowledge (source: Big Data Survey 2016, www.bigdatasurvey.nl).

Scaling knowledge

As Vincent Warmerdam, Data Scientist at GoDataDriven, says, "too often, innovation is slowed by reinventing the wheel and repeating experiments that weren’t documented in a central repository. Both data and experience should be made centrally available. This requires a solid knowledge sharing infrastructure within the organization that goes beyond a simple wiki or SharePoint site".

Technology

While organizational challenges increase, in 2017 technological challenges become less hard to overcome. Platforms and models are being commoditized every day. Many innovative technologies are open-sourced and made accessible within cloud platforms, making these innovations available to deploy on the fly.

Cloud

Now that the cloud is perceived as stable and secure by most organizations, this might very well be the year that the cloud will become the platform of choice for most organizations. In addition to well-established cloud offerings like Amazon and Azure, the elephant in the cloud space is Google Cloud. The Mountain View giant’s offering has matured massively around price, quality, and performance. BigQuery, Dataflow, and Dataproc are just a few of the impressive technologies that embody the zero OPS vision.

Making models scale

The real-time trend will push systems to new boundaries. Do your algorithms run daily? Or possibly even hourly? The future online world will need algorithms that can change every time a user clicks a button. Only algorithms that can learn this fast are actual real-time algorithms. Data scientists will move from Lambda-architectures (using one layer for speed and one for batch processing) to Kappa-architectures with features that define processing pipelines for both batch and real-time processing. Apache Flink is bound to become the architecture of choice in this field.

Is it too late for the Late Majority to catch up?

For organizations that are new to the experimenting with data scene, keeping up becomes increasingly difficult. Whereas the early adopters had room to play around in a developing field, in 2017 the late majority will find itself competing with those that have a powerful competitive advantage.

The experimenters have already success fully launched data driven products. They have the right technological foundation, skilled people, and processes in place. They’ve automated their platform, workflow, and modeling and are ready to go into production.

Creating more value and an increasing number of models every day means a substantial head start. But the race isn’t over yet.

New View on Benefits

Data Scientists value transparency and skills of their colleagues over salary (Big Data Survey 2016 www.bigdatasurvey.nl).

This article is part of the Urgent Future IT Forecast 2017.

Applied Data Science - The new standard in data-driven business

Explore more articles