TL;DR: When you upload a CSV to most AI data tools, your file goes to their servers, the AI reads every row, and retention policies vary wildly. We traced what actually happens with several popular tools using browser DevTools and published documentation. The results might change how you work with sensitive data.
The experiment
I took a simple CSV file — 500 rows of fake e-commerce data — and uploaded it to five popular AI data analysis tools. For each tool, I opened browser DevTools before uploading, watched the Network tab, and documented exactly what happened.
Then I read each tool's terms of service, data processing agreements, and privacy policies to understand what happens after the upload.
This isn't about calling out specific companies. These are legitimate products built by good teams. The point is: most analysts don't know what happens to their data, and the defaults are rarely optimized for privacy.
Tool 1: ChatGPT (Advanced Data Analysis)
What I did: Uploaded orders.csv and asked "What's the average order value by region?"
What the Network tab showed:
A multipart form upload to https://chatgpt.com/backend-api/conversation. The entire CSV was in the request payload. The file was transmitted to OpenAI's servers.
What happened next:
ChatGPT spun up a sandboxed Python environment, wrote pandas code, executed it, and returned a table with the results. The model had full access to every row in the file.
Retention policy (from OpenAI's docs):
- ChatGPT consumer: Conversations may be used to improve models unless you opt out in settings. Files are retained "for the duration of the conversation" but the exact deletion timeline is ambiguous.
- ChatGPT Enterprise/Team: Data is not used for training. Files are deleted after the session.
- API: Data is retained for 30 days for abuse monitoring, not used for training.
Key finding: On the consumer plan, your CSV data may contribute to model training unless you manually opt out. Most people don't. And the distinction between "consumer" and "Enterprise" plans matters enormously here.
Tool 2: Julius AI
What I did: Uploaded the same CSV and asked the same question.
What the Network tab showed:
The file was uploaded via a POST request to Julius's servers. A websocket connection then streamed back the analysis results.
What happened next:
Julius ran Python code against the full dataset on their servers. The AI had complete access to all rows and columns.
Retention policy (from their docs):
Julius states that uploaded files are used only for the current analysis session. Their privacy policy says data may be stored "as long as necessary to provide the service." There's no SOC 2 certification listed publicly. No BAA available for healthcare data.
Key finding: The privacy policy language is vague enough to give a security team pause. "As long as necessary" could mean minutes or months.
Tool 3: Google Sheets + Gemini
What I did: Imported the CSV into Google Sheets and used the "Help me analyze" Gemini sidebar.
What the Network tab showed:
The file was already in Google's infrastructure once imported into Sheets. When I triggered Gemini, additional requests were sent to Google's AI APIs with spreadsheet content.
What happened next:
Gemini analyzed the data and returned insights. The model had access to the spreadsheet data within Google's ecosystem.
Retention policy (from Google's docs):
Google Workspace data policies apply. For consumer accounts, data may be used to improve services. For Workspace Enterprise accounts, Google states that customer data is not used for advertising or training AI models. Gemini in Workspace has a separate data processing framework.
Key finding: If you're on a Google Workspace enterprise plan with proper DPA, this is relatively well-documented. On a personal Gmail account, the picture is murkier.
Tool 4: A Python notebook (Jupyter, local)
What I did: Opened Jupyter locally, loaded the CSV with pandas, wrote a query.
What the Network tab showed:
Network requests only to localhost:8888 (the local Jupyter server). The CSV was read from disk by the Python process running on my machine. No external network calls.
What happened next:
Pandas processed the data locally. No AI involved. I wrote the code myself.
Retention policy: Entirely in my control. The file stays on my disk. The notebook is a local file. Nothing is transmitted.
Key finding: Maximum privacy, maximum friction. Writing Python for every ad-hoc question is slow. This is the baseline against which we should compare other tools.
Tool 5: A browser-based DuckDB-WASM tool
What I did: Opened a browser-based tool running DuckDB WebAssembly, dragged in the CSV, and asked a question in natural language.
What the Network tab showed:
The file load generated zero network requests. The CSV was read from disk via the File API into browser memory. When I asked a question, a small request was sent to an AI API containing only the table schema:
{
"messages": [
{
"role": "system",
"content": "Table: orders\nColumns: order_id (INTEGER), customer_name (VARCHAR), region (VARCHAR), product (VARCHAR), quantity (INTEGER), unit_price (DOUBLE), order_date (DATE)"
},
{
"role": "user",
"content": "What's the average order value by region?"
}
]
}
No row data. No customer names. No revenue figures. Just column names and types.
What happened next:
The AI returned a SQL query. DuckDB-WASM executed it in the browser. Results rendered locally.
Retention policy: The AI provider (e.g., OpenRouter routing to Claude or GPT-4) retains the prompt per their policy — but the prompt only contains schema metadata. The actual data never left the browser.
Key finding: This architecture separates AI capability from data access. The AI helps write queries without ever seeing the data those queries run against.
Summary matrix
| Tool | File uploaded to server? | AI sees full data? | Retention clarity | Training data risk | Verifiable via DevTools? |
|---|---|---|---|---|---|
| ChatGPT ADA | Yes | Yes | Medium (plan-dependent) | Yes (consumer plan) | Yes |
| Julius AI | Yes | Yes | Low (vague policy) | Unclear | Yes |
| Google Sheets + Gemini | Yes (Google infra) | Yes | High (enterprise) / Low (consumer) | Plan-dependent | Partially |
| Jupyter (local) | No | No AI | N/A (you control) | No | Yes |
| Browser WASM + schema AI | No | Schema only | High (only schema sent) | Schema only | Yes |
How to check any tool yourself
You don't need to take my word — or any vendor's word — for it. Here's how to verify:
Step 1: Open browser DevTools before uploading
In Chrome: right-click > Inspect > Network tab. In Firefox: right-click > Inspect > Network tab. Clear the log so you start fresh.
Step 2: Upload your file
Watch the Network tab. Look for:
- POST requests with large payloads — that's your file being uploaded
- WebSocket connections — data might be streaming to a server
- Requests to third-party domains — your data might be going to an AI provider you didn't expect
Step 3: Ask a question
Watch for new requests. Check the request payload:
- Does it contain your actual data values? (Names, numbers, dates from your CSV)
- Or does it contain only metadata? (Column names, types, table structure)
Step 4: Check the response
Is the computation result coming from the server (meaning it ran on their infrastructure) or is JavaScript running a local query engine?
Step 5: Read the fine print
For any tool that sends data to a server:
- Find the privacy policy. Search for "retention," "training," and "third party."
- Find the terms of service. Search for "data," "license," and "use."
- If there's a DPA, read it. It often contradicts the marketing page.
What this means for your workflow
If you're analyzing public datasets, benchmark data, or anything you'd publish on a blog — use whatever tool is most convenient. ChatGPT is genuinely great for this.
If you're analyzing data that contains:
- Customer PII (names, emails, phone numbers)
- Financial records (revenue, costs, margins)
- Healthcare data (any PHI)
- HR data (salaries, performance reviews)
- Anything under NDA
Then you should know exactly where that data goes when you drag it into a tool. Open DevTools. Check the network requests. Read the retention policy. And consider whether a local-first approach — where the data never leaves your machine — is the better default.
The best tool isn't the one with the best AI model. It's the one whose architecture matches your data sensitivity.
QueryVeil is built on the browser-local architecture described in Tool 5. Schema-only AI, DuckDB-WASM, no file upload. Open DevTools and verify. Live demo with sample data, no signup required.