Ad Hoc Data Access -- the IT Balancing Act
Watch this video to learn why providing ad hoc data access for business users is a great idea -- in theory. But providing it on a budget of limited resources is a balancing act that can be hard to pull off.
While adopting horizontally scalable data storage with a BI front end gives IT the infrastructure to handle increasing user loads, it doesn’t necessarily solve other issues.
For example, in high-variety, big data environments, users need a self-service way to explore data without knowing the precise query they want to answer. But when many concurrent users explore data in this way, it can degrade response time.
Oftentimes, I get into conversations with solution architects that are looking to provide self-service access to data. They’ve got a user base that constantly complains they don’t have the right report. They can’t get the right report. And by the time they get it it’s not useful to them. So, they’re looking for solutions. They’re looking for options that get individuals the right data at the right time.
Adopting a Horizontally Scalable System
There’s also a lot of movement in the big data world and a lot of cool technology that is horizontally scalable. A common approach is, “Well, I’m going to get your data into this horizontally scalable system. I’m going to stick a BI tool on top of that. And then, you as a user can bang away at that to your heart’s content.” The benefit of that is that you’re moving to that scalable platform. You can scale out. As your user loads increase you can scale that platform out. The myth around that that I’m going to do a little myth busting on is that while your infrastructure is scalable and your systems are scalable your pocketbook usually isn’t.
Financial Constraints Can Limit Self-Service Access
You’re back to trying to maximize what you can do with the limited resources that you have. How much ad hoc queries can a user community generate that is going to be satisfied with a, satisfactory response time to the user within the resources that you have. And there’s this balance that needs to be brought into interplay between the IT professionals that are managing the users, that are managing the response times of the user queries and then the business that is looking to respond to business needs, get the information at the time that they need it and have that available to them seven days a week, 24/7, 24 hours a day.
Empower Users to Access and Explore Data
What do you do? What are your options? And really you can kind of take a look at two things. One thing is on the user side. On the user side it’s important to understand that to facilitate an ad hoc experience you really need to allow a user to start to explore data without any kind of--any prior knowledge of the query they want to answer. They need to be able to look at a dataset and understand what’s there. Especially relevant when you’ve got a high variety of data, such as in big data environments. When you have a high variety of data your users aren’t going to understand what is available to them. They’re not going to know what question to ask of the data, because they don’t know what’s there or what answers could and should be formed from the data that exists.
The first thing you need to be able to provide quick accessibility to the data and the ability to just look at what’s there and get the gist. The second thing is you need to allow users to explore the data. They need to be able to start with a query, start with a report, start with the dashboard, and then move away from that, move away from that context, switch up different variables. This has a pretty dramatic impact when you’re dealing with a limited amount of resources. If all users were accessing the same report then you on the backend you could cache that data, and you could bring that into memory or some other kind of performing layer, and you could manage the concurrency and the use of that. But, in the ad hoc paradigm you don’t know where the user’s going to go next. You’re constantly having to have a system that is responding to queries efficiently. And if they’re really working on large data volumes that just becomes a law of physics issue. You can’t have so many people looking through all the data all at the same time and not expend all your resources.
Techniques for Effectively Managing User Self-Service
There’s a number of different techniques that you’ll need to apply in terms of managing resources, managing user query load, managing memory footprint, looking at queries trying to understand the impact of those on the system, and all the while you’re working with a number of technologies that may or may not be fully baked or mature, especially if you’re adopting some of the larger scale, horizontally scalable systems.
Data Sampling and Profiling User Needs
There are some things that you can look at from this perspective. You can look at ways to sample data. You can give a user a sample of data. They can look at that. And then, they can decide if they’re good with the sample or they need to do other things or they need to actually consume more resources to get at the actual 100 percent confidence level of what they need to know. You could profile the users. You could start to like intelligently understand what it is they’re doing, what they’re trying to do, predict their next path through the data based on previous interactions with the data, do all of those kinds of things to look at it.
Data Modeling and Data Caching Aren’t the Answer
What you’re not going to be able to do is you’re not going to be able to model the data, especially in environments with high variety. You’re not going to be able to cache all the data. You don’t have--you have limited sets of resources. You can’t possibly do that. The balancing act between all of these things while providing this consistent user experience where the user’s not blocked by spinning wheels or hourglasses. They’re not going and fetching a cup of coffee before they have to like, you know, ask their next question. That’s not ad hoc analysis. To do ad hoc analysis the user has to be able to go from one item of data to the next in at least three to five seconds in order to not lose their attention span or to be able to have that single train of thought to where they actually formulate a question. We’re dealing with exploration and not a set of pre-canned questions.