Table of Contents
To give analysts access to data, you can use Presto. It is a tool that allows you to build SQL queries to work with Big Data. It helps solve ad hoc analytics tasks.
Running Presto on Kubernetes allows you to take advantage of all the autoscaling flexibility that is difficult to implement in a classic deployment. At rest, Presto consumes a minimum of resources and does not require a powerful cluster to run. But when analysts start sending a lot of requests, the load grows. An autoscaling Kubernetes cluster will allocate the required amount of capacity, and this will allow all analysts to work simultaneously without having to compete for resources. When the load subsides, unnecessary resources will automatically return to the cloud.
Currently, there is Presto Operator and Presto Helm Chart from Starburst. This way, you can quickly deploy Presto to Kubernetes.
Presto And Hive Metastore: Presto can receive data from different sources: Hadoop, PostgreSQL, and so on – and build queries on the combined amount of data. If you store data in S3, then the Hive Metastore is used for Presto to work with it. It allows you to represent data in S3 not just as a set of files but as a set of tables with data. Analysts do not need to know which S3 bucket the data is stored in: everything is in the Hive Metastore. You can use SQL to access the data and work in the usual way.
To use the collected data for BI analytics, it needs to be wrapped in charts, dashboards, and other understandable ways of presenting the information. For this, Superset is suitable – a business intelligence tool for researching and visualizing data, an Open Source analog of Tableau. At the same time, Superset is flexible and Cloud-Native in using various services as a backend.
Out of the box, it supports integration with Presto, Greenplum, Hadoop, and many other systems. Plus, it already has many ready-made visualizations, but there are tools for creating your own. If you integrate it with Presto, you can work with data in S3 using Superset as the SQL IDE. There is also an alternative tool to Superset, Metabase, which can also be run in Kubernetes.
The benefit of running Superset on Kubernetes: Superset is designed for high availability. It is a Cloud-Native tool that scales well in large distributed environments and can serve several hundred users simultaneously.
To populate the data warehouse, you need an orchestrator or workflow management platform. It allows you to create a schedule for tasks and indicate the sequence of their launch, depending on the result of the previous task. Now the de facto standard in this area is Airflow, a platform for developing, planning, and monitoring data processing flows.
The benefits of running Airflow on Kubernetes are the same as other tools: flexible scaling and sandboxing.
By default, Airflow running on Kubernetes will store logs in temporary storage. To keep the logs always available, you need to connect persistent storage, for example, S3. This applies to all tools that run on Kubernetes.
Next, let’s talk about the Data Discovery problem. Let’s say your storage has grown, and there are already thousands of tables in it. When a new analyst comes to a project, he needs to get somehow acquainted with all this data, understand where and what lies. Often this is solved by personal communication: he asks for help from colleagues. It takes a long time, plus specialists are distracted from their main work.
There is an open-source Amundsen platform to solve the problem. It has a UI that allows users to easily access data. You can fill Amundsen with metadata manually or automatically if you integrate the tool with Airflow. At the same time, you can collect statistics on tables; there is a search, the ability to set tags, specify the owner of the data, the type of table, and so on. This helps to significantly increase the productivity and efficiency of data warehouse use and solves democratizing access.
To train models and conduct experiments in Big Data, JupyterHub is often used; this is also an industry standard.
It is important to deploy machine learning models quickly in production; otherwise, the data will become outdated, and there will be problems with the reproducibility of experiments. But sometimes, the process is structured in such a way that it takes a long time to transfer models from Data Scientist to Data Engineer.
MLOps helps to cope with this problem. It is an approach that standardizes developing machine learning models and reduces the time it takes to roll them out to production. With its help, new models are quickly transferred to production and begin to benefit the business. But to apply this approach, you need special tools.
One such tool is Kubeflow, a machine learning and Data Science platform. Kubeflow includes JupyterHub, so you don’t have to deploy it separately. It also helps solve the problems of tracking experiments, models and artefacts. Plus, Kubeflow allows you to bring models into production in a few minutes and make them available as a service.
Note: we’ll talk more about MLOps and Kubernetes in a separate article. In it, we create a Kubernetes cluster, deploy Kubeflow in it, train and publish the model.
We also hosted a webinar on MLflow. There is a video and a repository with instructions.
Advantages of running Kubeflow on Kubernetes: Kubeflow was specially created for Kubernetes, so it basically cannot be launched separately. Here, instead, it is worth mentioning the advantages of Kubeflow over other non-Kubernetes counterparts: fast publishing of models, orchestration of complex pipelines, convenient UI for managing experiments and monitoring models.
How to run Kubeflow in Kubernetes: there is a detailed instruction on the official website. Alternatively, you can make your life easier by deploying Kubeflow to cloud Kubernetes using this tutorial.
But it is worth considering that Kubeflow is still developing, so it is a little damp. There is an alternative – MLflow, a more stable platform, but it works with Kubernetes only in experimental mode. If we compare Kubeflow and MLflow with each other, the first one scales better, more functional and promising. MLflow is easier to use and more mature as a product; therefore, it is suitable for industrial use. However, it does not have the same breadth of functionality as Kubeflow (for example, MLflow does not have a built-in JupyterHub).
The existence of several accounts in miscellaneous social networks allowed me to understand that one…
Introduction Access to new technologies and artificial intelligence has become vital in today's digital era.…
Google Chrome is the most used browser today due to its speed, reliability, and versatility…
Staying relevant in the dynamic digital environment is impossible. Besides influencers, small business owners, and…
A college education is now of great significance, and technology is the key factor in…
How2Invest is a tool that can give you inside information and professional money advice. Like…