TL;DR: Your analysts want AI data tools. Your job is to evaluate whether those tools are safe to approve. This guide gives you a concrete framework: what questions to ask, what architectures exist, where the real risks are, and a comparison matrix of popular tools. No vendor fluff — just the evaluation criteria that matter.
The actual problem you're solving
Someone on the data team just asked to use an AI-powered analysis tool. Maybe it's ChatGPT, maybe it's a startup you've never heard of. They want to drag CSVs in and ask questions in plain English.
You need to answer: Can they use this with our data? Under what conditions?
This isn't a theoretical exercise. Analysts are already uploading CSVs to AI tools, often without asking. A 2025 survey by Cyberhaven found that 11% of data employees paste into ChatGPT contains sensitive information. You're not preventing usage — you're trying to make safe usage the path of least resistance.
The five-question evaluation framework
Every AI data analysis tool, regardless of branding, falls somewhere on these five dimensions:
1. Data residency: Where does the data physically go?
This is the first and most important question. There are three patterns:
Pattern A: Full upload. The user's file is sent to the vendor's cloud infrastructure. The vendor's servers store it (at least temporarily) and run computations against it.
Pattern B: Server-proxied. The file stays on the user's machine, but data is sent to a server for processing in chunks or via API calls.
Pattern C: Client-side. The file never leaves the user's device. Computation happens in the browser (via WebAssembly) or in a local application.
What to ask the vendor:
- Where is my file stored after upload? Which cloud provider and region?
- Is data transmitted over the network at any point during analysis?
- Can I verify this claim independently (e.g., browser DevTools, network inspection)?
- Is there a self-hosted or on-premise option?
2. Model access: What does the AI see?
Even if data stays local, the AI component might have access to it. Three tiers:
Tier 1: Full data access. The AI model receives the entire dataset (or large portions of it) as context. This is how ChatGPT Advanced Data Analysis works — the model reads every row.
Tier 2: Schema-only access. The AI receives column names, data types, and metadata — but never actual data values. It generates SQL or code that runs locally.
Tier 3: No AI. The user writes their own queries. No model involved.
What to ask the vendor:
- What exactly is included in the prompt sent to the AI model?
- Can you show me a sample prompt for a typical query?
- Is there an option to use local/on-device AI models?
- If schema metadata is sent, does it include sample values or just types?
3. Data retention: How long is data kept, and by whom?
What to ask the vendor:
- What is the data retention policy for uploaded files?
- What is the retention policy for AI prompts and responses?
- Is data used to train or fine-tune AI models? (Check both the vendor's policy and the upstream AI provider's policy.)
- Can data be deleted on demand? Is deletion verifiable?
- What happens to data if the vendor is acquired or shuts down?
Red flags:
- "Data is retained for service improvement" without a clear opt-out
- Different retention policies for API usage vs. consumer product usage (common with OpenAI)
- No mention of upstream AI provider retention policies
4. Authentication and access control
What to ask the vendor:
- Does the tool support SSO (SAML, OIDC)?
- Is there role-based access control?
- Can you restrict which data sources or file types users can load?
- Is there an admin console with usage visibility?
- Does the tool support MFA?
For browser-based tools, also ask:
- Does the tool require an account to function?
- What data is associated with user accounts?
- Is there a "no account" mode for maximum privacy?
5. Audit trail and compliance documentation
What to ask the vendor:
- Is there an audit log of queries run and files accessed?
- Can logs be exported to your SIEM?
- Does the vendor have SOC 2 Type II certification?
- Is there a DPA (Data Processing Agreement) available?
- Can the vendor provide a BAA for HIPAA-covered data?
- Is there a published security whitepaper or architecture document?
Tool comparison matrix
Here's how popular AI data analysis tools compare across these dimensions. This is based on publicly available documentation as of early 2026.
| Tool | Data Residency | AI Sees | Retention | SSO | SOC 2 | BAA Available |
|---|---|---|---|---|---|---|
| ChatGPT ADA | OpenAI servers | Full data | Session + policy | Enterprise only | Yes (Enterprise) | Enterprise only |
| Julius AI | Julius servers | Full data | Per policy | No | No | No |
| Google Colab + Gemini | Google servers | Full data | Per Google policy | Google Workspace | Yes | Via Google Cloud |
| Jupyter + local LLM | Your machine | Configurable | You control | N/A | N/A | N/A |
| DuckDB CLI | Your machine | None (no AI) | You control | N/A | N/A | N/A |
| QueryVeil | Browser (WASM) | Schema only | None (browser memory) | No* | No* | N/A** |
*QueryVeil is an early-stage product. SSO and SOC 2 are on the roadmap for enterprise plans. **Because data never leaves the browser, a BAA may not be required — but consult your legal team.
Key takeaway from the matrix: The tools with the best enterprise compliance features (SSO, SOC 2, BAA) are the same ones that upload your data to their servers. The tools with the strongest data privacy architecture (client-side, schema-only) are often earlier-stage products without full enterprise certifications yet.
This is the actual tradeoff your team needs to evaluate.
Architecture patterns: a visual comparison
Pattern 1: Full cloud
User's file --> Upload --> Vendor's server --> AI model (full data) --> Results
|
Stored on disk
Retention policy applies
Pattern 2: Schema-only with browser-local compute
User's file --> Browser memory (File API) --> DuckDB-WASM (queries)
|
Schema extract
|
AI model (schema only) --> SQL
|
DuckDB-WASM (runs SQL locally)
|
Results (in browser)
Pattern 3: Fully local
User's file --> Browser memory --> DuckDB-WASM --> Local AI (WebLLM/Ollama) --> SQL
|
DuckDB-WASM (runs SQL)
|
Results (in browser)
Pattern 3 is the only architecture where zero data leaves the device. Pattern 2 leaks schema metadata (column names and types) to an AI provider. Pattern 1 leaks everything.
Risk assessment by data type
Not all data carries the same risk. Here's a practical classification:
| Data Classification | Examples | Acceptable Tool Pattern |
|---|---|---|
| Public | Open datasets, Kaggle data, published benchmarks | Any tool (Pattern 1, 2, or 3) |
| Internal | Revenue metrics, product analytics, operational data | Pattern 1 with DPA, or Pattern 2/3 |
| Confidential | Customer PII, financial records, HR data | Pattern 2 (schema-only) or Pattern 3 (fully local) |
| Restricted | Healthcare PHI, classified data, source code | Pattern 3 only (fully local, air-gapped) |
This gives your analysts clear guidance: "Use whatever you want for public data. For customer data, use a tool from the approved list that doesn't upload files."
Common objections from analysts (and how to address them)
"But I need AI to help me analyze data faster."
You can have AI without uploading data. Schema-only AI generates SQL queries from column names and types. The AI doesn't need to read your customer records to write a GROUP BY query.
"The cloud tool is so much easier."
It's easier because the vendor absorbs the complexity (and your data). Browser-based tools like QueryVeil are approaching feature parity while keeping data local. The UX gap is closing.
"I only upload de-identified data."
De-identification is harder than people think. Research consistently shows that "anonymized" datasets can be re-identified with auxiliary data. If the dataset has more than a few columns, assume it's identifiable.
"Our vendor says they don't use data for training."
Verify this in their current terms of service, not their marketing page. Check the upstream AI provider's terms too. And remember: policies change. Today's privacy commitment may not survive the next board meeting.
Recommended policy template
Based on the framework above, here's a policy structure to consider:
- Classify your data using the four-tier model (Public / Internal / Confidential / Restricted).
- Approve tools per classification. Maintain a list of approved tools for each data tier.
- Default to local-first. For any data classified as Confidential or above, require tools where data stays on the user's device.
- Require vendor documentation. Before approving a new tool, collect: DPA, retention policy, architecture documentation, and AI provider terms.
- Audit quarterly. Review which tools are in use, what data is flowing through them, and whether the vendor landscape has changed.
- Provide a path of least resistance. If you block the easy tools without providing an approved alternative, analysts will find workarounds. Give them something that works.
The bottom line
AI data analysis tools are not going away. Your analysts will use them. The question is whether they use tools your security team has evaluated, or tools they found via a Google search.
Build an evaluation framework based on the five dimensions above. Classify your data. Match tool architectures to data sensitivity. And give your analysts approved options that are genuinely good enough to use.
QueryVeil is a browser-based AI data analyst where files never leave the device and the AI only sees schema. If you're evaluating tools for your security team, try the live demo to see the architecture in action — no signup or file upload required.