Introducing Virtualitics Smart Sampling: Get the Right Sample Every Time
Written by Amanda Derrick
Mar 22, 2023 9:00:00 AM
Gathering a sample from your dataset that properly represents the nuances in the data is challenging for anyone, even seasoned data scientists. Since it’s likely that the data you need to explore is massive, sampling can’t be avoided. But taking shortcuts–first and last rows, timebound, and even the classic random sample –can introduce bias into your exploration. With our latest release, we’ve introduced a new Smart Sampling application that streamlines data sampling, while still allowing you to tailor the approach to suit your needs.
Sampling Challenges
While a random sample is serviceable, it’s not ideal for every situation or data set. Data scientists will be the first to tell you that creating a data sample beyond the classic random sample requires special skills. There’s also processing time to consider. The time, effort, and skill required all pose a barrier to proper sampling by data scientists and analysts alike. Even platforms that promise to sample for you take shortcuts, like taking the first and last few hundred rows.
But there is a significant risk in using a random sample to determine your business strategy. Your sample might miss or misrepresent what’s actually happening in the data. Anomalies are easily missed in a big data set, and relationships and dependencies can become skewed if the sampling technique is wrong. This leads to missed insights or misdirection, which means poor strategic choices and projects that won’t deliver the results you’re looking for. Even for those who do have the skill, sampling big data correctly takes time and attention to detail. When teams don’t have the time and resources they need, they take shortcuts. This means your organization’s carefully curated data is not being used responsibly or efficiently.
We’ve made sampling smarter so that users can get recommended samples without spending crazy amounts of time and resources. Virtualitics’ patent-pending collection of AI-based data refinement applications lets you extract the right data sample for your analysis, without coding and in a fraction of the time. Our highly scalable application lets you select from our sampling algorithms (including the classic Random Sample) to generate a sample that will meet your requirements.
Make smarter decisions with Virtualitics Smart Sampling
The right sampling algorithm depends on the type of analytics task being completed. For some projects and datasets taking a simple random sample might do the trick. But in other scenarios, you can leverage one of our patent-pending sampling algorithms. Virtualitics’ flexible Smart Sampling application offers a number of different algorithms for data sampling, each using a variety of sophisticated techniques—including clustering, feature importance, and Kolmogorov-Smirnov (KS) tests to help recommend samples for analysis.
For example, to generate a sample with points that are spatially well-distributed over the feature space you could use Spatial Sampling. As another option, if you want to ensure that your sample has similar relationships to a key performance indicator (KPI) as it does in your original dataset and contains a similar number of anomalous points, then we recommend choosing our Validation Based Sampling. This algorithm iteratively tests many samples and selects the one that best matches the search criteria.
These are choices and methods that would take significant skill and time for data scientists to complete on their own. Our Smart Sampling technology means that more data scientists and data analysts are able to explore and find value in complex data, finally delivering the ROI that your data capture efforts deserve.
Anomaly Detection at scale
Anomaly detection can be a complex task and is made even more difficult when working with large datasets. However, data analysts and data scientists shouldn’t ignore anomalies just because they can be difficult to identify. Our Anomaly Detection application runs on your big datasets to identify anomalies and flag them for analysis. Our tools give you significant control over how you’d like to find anomalies—and more methods are planned for future releases. Users can leverage this application on its own, or with Filtering or the Smart Sampling application.