Without foundational governance, every AI deployment is a liability in disguise: Q&A with Jack Berkowitz of Securiti
We’ve heard that good data is the key to better LLMs and other AI systems. The big foundation models are trying to get more and more data to train on in order to improve their reasoning, accuracy, and abilities. But avoiding bad data is just as important in AI; it can open you to fines, lawsuits, and lost customers. We asked to Jack Berkowitz, Chief Data Officer of Securiti, about the risks and rewards of strong data governance.
Ryan Donovan: Data for LLM training hasn’t always been gathered or cleared in the most IP-friendly way. What sort of blowback and breaches are we seeing now because of that?
Jack Berkowitz: We’re now seeing real-world consequences: from multimillion-dollar copyright lawsuits to forced takedowns of AI models. Organizations that trained models on unlicensed or misclassified data are now facing IP and copyright challenges. The risk is compounded when sensitive or proprietary data has been ingested without authorization, creating potential compliance violations that are difficult to contain, especially when fed into AI models. A lack of data governance not only undermines trust but also exposes organizations to legal and reputational risk. Without rigorous control over data sourcing and usage rights, the blowback is inevitable and could likely result in lawsuits, potential fines, and opening the company to data leaks.
RD: For data owners who may have had their data used in training without their consent, what rights and recourses do they have?
JB: Data owners can pursue action under existing laws like GDPR, CCPA, and CPRA, which already provide processes for contesting unauthorized data use, requesting transparency, and enforcing takedown. For organizations, this means more enforcement on providing clear data tracking and audit trails.
As an example, in 2023, Meta was fined $1.3 billion by Ireland’s Data Protection Commission for unlawfully transferring EU user data to the U.S. without proper privacy safeguards under GDPR.
The thing is that the technology already exists to prevent problems. Permissioning and consent management exist. However, for most companies, it is managed separately from the actual data used to train. New platforms like Securiti provide a comprehensive approach to managing the permissions with the data and allow companies to comply while still making progress with their data and AI programs.
RD: I’ve heard of watermarking or poisoning data (aka Nightshade) or setting honeypots against data scrapers. Are any of these effective defenses?
JB: These are not effective overall. Most are fairly easy to work around, and data poisoning can introduce just as many risks as it aims to prevent. Watermarks sound promising in theory, but models can often be trained to ignore them. In practice, none of these approaches offer reliable protection at scale.
RD: For organizations looking to use external data, whether in a published data set or sold by a broker, what can they do to verify that they data therein is safe to use for model training?
JB: For organizations using third-party data, it’s essential to verify that the data is scanned, managed, controlled like any other data. This requires having full visibility into data flows, authorizations, and permissions, along with documentation that confirms how the data can be used. There are legal guarantees that the vendors should provide, including monetary damages. If the vendor is not willing to do that, then you should be suspect.
Without this level of oversight, organizations risk introducing compliance gaps and legal liabilities into their models. Even well-intentioned firms have seen trained models pulled from production due to unclear sourcing.
RD: For data that they do own, what are best practices for classifying and tracking data as to not inadvertently leak sensitive internal data or PII?
JB: Start by discovering and classifying all data assets, including unstructured data, across cloud and on-premises systems. From there, enforce access controls and apply sanitization protocols to prevent sensitive data from entering AI pipelines unintentionally. Real-time monitoring of data flows and AI interactions is key to maintaining visibility and preventing leaks to ensure both security and compliance. With LLMs increasingly able to memorize sensitive prompts, the margin for error has disappeared.