(soft tech music)
(dry marker writing) Hi, I’m Matt Maccaux and today I’m gonna talk about how HPE BlueData can help your organization accelerate the software development lifecycle for data science and advanced analytics and giving you a path to hybrid cloud deployments. So first, let’s talk about the different sets of users. There’s oftentimes a tension between the data science community and the data operations community. The data scientists are the folks requesting access to all of the data. They’re bringing their own tools and using libraries that have never been certified by the organization. And the operations team are the folks responsible for providing access to that data, spinning up environments while trying to figure out how they can support cloud in the future. So, that tension exists, and it’s up to the architecture team to solve that, to come up with a design, or an architecture that can meet the requirements of the data science team, continue to provide that agility without breaking the operations team and figuring out how we can get to cloud. So, what is good look like from that perspective? Well, the data scientists out there need access to tools, whether these are tools that have been certified by the operations team, or tools that they’re bringing in from outside, they need tools. Whether that’s RStudio, R Shiny. They also need access to IDEs, Jupyter Notebooks, Zeppelin Notebooks, you’ve got data engineers that need IDEs like IntelliJ if they’re developing in JAVA or Ruby IDEs. The point is we can’t predict what tools and IDEs these various operations teams have to support. We also have code or models that these engineers and scientists are gonna be pulling out of repositories whether that’s a Git repository or a set of data of science tools and workbenches that are storing those code models. And then lastly, we’ve gotta figure out how we can give these different users access to data. Given the ability to bring data in, potentially from outside without corrupting our operational data stores. Meanwhile, the data operations team is interested in things like how often these environments going to be refreshed? How often are you spinning up and down? What are the performance characteristics of these environments? How do we manage this across heterogenous infrastructure? You’ve got GPUs here, but no GPUs there. How do we support that quickly in an agile manner for these users while also making sure we have the chargeback show-back that’s required so they pay as they go whether we’re on-premises or in the public cloud? Now, all of this, this request interface, should be using existing IT tools, like a ServiceNow service catalog. Many of my customers use ServiceNow as a catalog to provide this interface to those data scientists. But whatever interface you happen to use should be metadata-driven. This metadata’s gonna feed information about software libraries, code libraries, metadata stores to give that data catalog experience. And this metadata is also gonna be used to generate templates to drive automation. So when I say automation, I really mean CI/CD or DevOps. Everything that we do when we deploy environments should be following predictable, automated processes so that if a data scientist creates a model and trains a model that gets pushed into production, when they need to refresh that model they can simply rebuild the environment following the existing automated processes. Now where things get interesting and what we really have to solve for is, “Okay, how do I deploy these tools and these IDEs “across heterogenous infrastructure?” Well, that’s where HPE BlueData comes into play. What we are doing under the covers is using docker containers to take the tools that your data scientists and engineers require and wrapping them up in containers in an unmodified format. If they are also using tools that are Kubernetes-based, we will deploy those same tools side-by-side managed through a single pane of glass using your Kubernetes distribution. And this is all done in a multi-tenant manner. And when I say multi-tenant I don’t mean yarn. I mean software-defined network to create true tenants that have specific set of infrastructure requirements so that we can deploy, well, let’s say that this is my data engineering tenant, a full CDH cluster, that’s Cloudera. I’ve got a master node and a bunch of worker nodes. BlueData is deploying this using docker containers and Cloudera under the covers is none the wiser. Meanwhile, we probably have a data science tenant and this is an exploratory tenant with no built-in quota. It can use as much infrastructure as needed because we’re model training. And so we’ve installed RStudio, Python, Jupyter Notebooks, and Spark. We’ve got a dozen Spark nodes that we’ve deployed using docker containers and we’ve allocated infrastructure resources under the covers, whether that’s GPUs where we’ve installed the driver in the libraries, or CPUs in memory. The last thing to note here, though, is that because this is using docker containers and software-defined networking, we can now, if a data scientist does something poorly or we crash a container, well, we can just simply spin them up again, or add more containers in capacity. We can also restrict the amount of capacity that they use through our interface. Now, the last point here, though, is that we haven’t talked about data. We’ve talked about these environments. It’s not efficient for us to copy data into every single one of these tenant environments. I don’t wanna spin up a CDH cluster for every data science team and copy all that data in, because we’re talking hundreds of terabytes to petabytes scale and that’s not an efficient use of our resources. So what we wanna do here is we wanna look at our data lake estate, whether we have one data lake or many data lakes, we wanna separate though these logical lakes into two parts. The first part is the curated part. This is where we have our data flowing in whether that’s ETL or Ingest, whatever processes you use to get data into the lake this is the Read Only data. This is my gold and my master data. It comes in through these curated processes where we may be doing transformation and we’re feeding that information back to the metadata catalog so that we can update the information here, in the request interface. Now, we said we don’t wanna copy the data, so what we need to do is we need to give these users the ability to tap into the data based on what they selected when they made the request through this request portal. And of course, this is Read Only. We have to enforce that Read Only nature so that they don’t corrupt any of the information that is flowing into this lake, which means that this other logical partition of this data lake needs to be Read/Write. And we want to be able to give writable sandboxes for these different analytical users and tap into that which is where they perform their work. That’s where they do the joins on this Read Only data from many sources. It’s maybe where they even potentially bring data from outside. And what’s important here is that while maintaining the integrity of the lake we’re also giving a scratch space that has full audit and traceability about the work that they are doing. And from an operations perspective, we have a timer, so that when the time is up, when this cluster is done, we are going to archive this information following this process that we’ve defined here, shut this cluster down and free up these resources, and make them available to someone else. And it’s the BlueData software that provides this capability, tying into your existing data lakes, your existing metadata repositories, tying into your existing operational processes whether it’s a full CI/CD pipeline or some DevOps, as well as integrating into your existing IT request interface. And so this gives the agility that the data scientist requires, spin up, spin down, user tools, bring data from outside, while maintaining the heterogenous nature of the infrastructure so that we can independently scale compute and storage, as well as use tenants that are in the cloud. So maybe this an EC2 tenant, or maybe it’s a Azure tenant, or GCP tenant. The BlueData software can operate across any infrastructure including all three public clouds. So now you have a path to cloud in a way that is completely abstracted from those users. And so we will leverage your existing investments in software and tools. We’re not replacing your Hadoop or your Ingestion or your management, your creation tools. Your data scientists get to use the tools that they wanna use and the operations team has a single pane of glass to manage all this. And so it’s HPE BlueData that provides that agility, leveraging your existing investments to accelerate the time to value that the business requires. (soft tech music) Learn more about HPE BlueData here.

Managing Data Operations for AI / ML, Data Science, and Analytics with BlueData
Tagged on:                                                             

Leave a Reply

Your email address will not be published. Required fields are marked *