Home / Malware & Threats / How Does Data Quality Drive AI Threat Hunting Success?

How Does Data Quality Drive AI Threat Hunting Success?

Nov 7, 2025 Interview

Kendra HainesNetwork Security Specialist

I’m thrilled to sit down with Rupert Marais, our in-house security specialist with a deep background in endpoint and device security, cybersecurity strategies, and network management. With years of experience in navigating the evolving landscape of cyber threats, Rupert has a unique perspective on how data and AI are transforming threat hunting. In this conversation, we dive into the critical role of a robust data platform in powering AI-driven security, the challenges of siloed systems, the power of data correlation in uncovering sophisticated attacks, and the strategic approaches organizations should adopt to stay ahead of adversaries.

Can you walk us through why a strong data platform is so essential for AI-driven threat hunting?

Absolutely. A strong data platform is the backbone of any effective AI-driven threat hunting effort. AI models are only as good as the data they’re trained on and operate with. If your data is fragmented, incomplete, or poorly structured, the AI will struggle to identify real threats or will produce false positives. Think of it like trying to solve a puzzle with half the pieces missing—you’re just guessing. A robust platform ensures that data from across your environment, whether it’s endpoint, cloud, or identity systems, is unified and accessible, giving AI the full picture it needs to detect anomalies and patterns that indicate malicious activity.

How do siloed systems impact the effectiveness of threat hunting efforts?

Siloed systems are a major roadblock. When data is trapped in separate buckets—like endpoint logs in one place and cloud activity in another—security teams and AI tools can’t see the connections between events. For example, a compromised device might show unusual behavior in endpoint logs, but without correlating that with cloud access logs, you might miss that the same attacker is escalating privileges in your AWS environment. This fragmentation slows down investigations and leaves gaps that sophisticated attackers exploit. It’s like trying to track a criminal with only half the surveillance footage—you’re always a step behind.

What kind of difference does a unified data platform make in spotting threats?

A unified data platform is a game-changer. When all your data is brought together and correlated, you start seeing patterns that were invisible before. For instance, a user downloading a large number of files might not raise a red flag on its own. But if that same identity is also creating public cloud storage buckets and cloning repositories to a personal device, the behavior becomes clearly malicious. Unifying data reduces noise and lets both analysts and AI focus on real threats rather than chasing disconnected signals. It’s about creating a single source of truth that reveals the full story.

Why is correlating data across different systems so crucial for identifying sophisticated attacks?

Correlation is everything when it comes to sophisticated attacks because modern threats often span multiple systems. Attackers don’t stay in one place—they move laterally, using stolen credentials or short-lived tokens to jump from endpoints to cloud environments to internal databases. By correlating data from logs, configurations, and identity systems, you can spot this movement in real time. For example, endpoint logs might show a local compromise, but without tying that to IAM role changes in the cloud, you can’t grasp the full scope of the breach. Correlation turns scattered clues into a clear timeline of an attack, speeding up detection and response.

Can you share a real-world example where data correlation across platforms was key to understanding an attack?

Sure, take the Salesloft/Drift breach as a prime example. Attackers initially gained access through a compromised GitHub account, then leveraged OAuth tokens in Drift’s AWS environment to infiltrate connected customer environments via a trusted integration with Salesforce. On their own, logs from each platform might have looked benign or unrelated. It was only when forensic teams correlated activity across GitHub, identity, and cloud systems that the full attack path became clear. Without that cross-platform correlation, the breach’s scale and impact would’ve been much harder to uncover, letting the attackers linger undetected for longer.

How does the quality of data influence the performance of AI tools in threat hunting?

Data quality directly determines whether AI tools give you reliable insights or just wild guesses. High-quality, well-structured data means AI can produce deterministic answers—clear, actionable findings based on real evidence. Poor data, on the other hand, leads to probabilistic outputs, where the AI is essentially making educated guesses that often miss the mark. I’ve seen cases where improving data quality had a bigger impact on threat hunting outcomes than any tweak to the AI model itself. Garbage in, garbage out—it’s a cliché for a reason. If you want AI to perform, you’ve got to feed it clean, complete, and correlated data.

What’s your take on how organizations should approach data storage to support effective threat hunting?

Organizations need to be strategic about data storage. Not every piece of data needs to be instantly accessible—there’s a big difference between hot and cold storage. High-value telemetry, like identity changes or cloud configuration updates, should be in hot storage for quick querying during an incident. Less critical or historical data can sit in cold storage for forensic analysis later. This approach keeps response times fast without ballooning costs. Prioritizing the right data also means you’re not wasting resources on low-signal noise, so both analysts and AI can focus on what matters most when a threat emerges.

How does a well-designed data pipeline enhance the capabilities of AI tools like large language models in security operations?

A well-designed data pipeline is like giving AI tools a clear roadmap. Large language models, for instance, thrive when they’re fed the right data with the right context. If the pipeline pre-transforms and correlates information, the AI doesn’t waste energy trying to figure out structure or relevance—it can focus on reasoning through potential threats. Without that, you’re either drowning the model in irrelevant details or starving it of critical facts, both of which lead to poor results. A good pipeline ensures the AI gets just enough context to be effective, much like how a good analyst needs curated information to make quick decisions during a crisis.

What’s your forecast for the future of AI and data platforms in cybersecurity?

I believe we’re heading toward a future where AI and data platforms become even more tightly integrated in cybersecurity. As threats grow faster and more complex, the organizations that will stay ahead are those that invest in unified, high-fidelity data platforms to power their AI tools. We’ll likely see more advancements in automated context engineering, where data pipelines dynamically adjust to feed AI the most relevant information in real time. This will make threat hunting not just reactive but truly proactive, turning uncertainty into actionable understanding. The gap between those who master their data and those who don’t will widen, and I think that’s where the real battleground of cybersecurity will be in the coming years.

How Does Data Quality Drive AI Threat Hunting Success?

Related Publications

Subscribe to our weekly news digest.