You are choosing an AI service for working with corporate documents and see two camps: convenient cloud solutions like ChatGPT or Notion AI — and self-hosted options where everything is deployed on your own server. The difference in convenience is obvious. But where your documents physically end up in the process — that is a question most businesses never ask until their first GDPR audit. Short answer: cloud services store your data on servers in the US. Self-hosted — only on your server. For businesses in the EU, this is the difference between compliance and violation.
⚡ TL;DR
- ☁️ OpenAI FileSearch: files are stored on OpenAI servers (US, Microsoft Azure) — no EU region by default
- 📓 Notion AI: data is processed through sub-processors (Anthropic, OpenAI) — servers outside your control
- 🏠 Self-hosted: all components on your server — no external access
- ⚖️ GDPR status: cloud requires DPA + SCCs + DPIA; self-hosted is compliant by default with an EU server
- 🏥 For healthcare and legal: cloud AI is legally unacceptable without special measures
- 👇 Below — a detailed breakdown of each option with real facts from provider documentation
📚 Table of Contents
- How cloud AI handles your documents
- Where OpenAI FileSearch physically stores your data
- Where Notion AI physically stores your data
- What self-hosted means and how it differs architecturally
- Comparison table: OpenAI vs Notion vs self-hosted
- Which businesses cannot legally use cloud AI
- Conclusion: when self-hosted is the only option
- FAQ
- Key takeaways
- Want to check your option?
How cloud AI handles your documents
When you upload a document to a cloud AI service — it is physically copied to the provider's servers. There it is split into chunks, indexed and stored to answer your queries. Your document is no longer only yours.
Cloud AI services are convenient. You sign up, upload a PDF, and get answers in seconds. But behind that convenience is a technical process most users never see.
Here is what happens to your document after uploading to a cloud service:
- ✔️ Transfer: the file is sent over the internet to the provider's servers — encrypted, but to third-party infrastructure
- ✔️ Parsing and chunking: the document is split into text fragments of a few hundred words each
- ✔️ Vectorization: each fragment is converted into a numerical vector and stored in the provider's vector database
- ✔️ Storage: both the original file and the vectors remain on the provider's servers — often with no clear retention period for free plans
- ✔️ Queries: every question you ask the AI is also sent to the provider's servers and may be stored in logs
For personal use — this is fine. But for corporate documents containing client personal data, medical records or attorney-client privilege — each of these steps is legally significant under GDPR. For a detailed comparison of popular AI document services, see our overview 5 AI Services for Document Work: Business Comparison →
It is important to understand: the provider is not necessarily misusing your data. But the mere fact that your documents physically reside on their servers means they are a data processor under GDPR — and the entire chain of requirements (DPA, transfer risk assessment, DPIA) becomes mandatory.
Summary: cloud AI always means transferring your documents to a third party. The question is whether that third party is in the right jurisdiction and whether you have the required documentation.
Where OpenAI FileSearch physically stores your data
OpenAI FileSearch stores uploaded files and vector indexes on OpenAI servers in the US (Microsoft Azure infrastructure). EU region selection is not available for standard API customers. EU data residency is only possible for ChatGPT Enterprise customers — a separate product at a separate price.
OpenAI FileSearch is a built-in tool for searching uploaded documents within the Assistants API and Responses API. Technically it works like this: you upload a file, it is automatically chunked, vectorized and stored in a vector store on OpenAI's servers.
Key facts about data storage according to official OpenAI documentation:
- ✔️ Default location: OpenAI servers in the US, Microsoft Azure infrastructure. Standard API customers cannot select an EU region
- ✔️ File retention: vector stores with attached files are deleted by default 7 days after last use. But files in the library are retained until manually deleted or the account is closed
- ✔️ Queries and logs: according to OpenAI's EU Privacy Policy, OpenAI uses Standard Contractual Clauses (SCCs) for data transfers outside the EU — but after Schrems II this may not be sufficient
- ✔️ Model training: for API and Enterprise customers, OpenAI officially does not use data for model training. For free and Plus users — it does by default unless disabled in settings
- ✔️ Staff access: OpenAI may review content for safety and service improvement. Full technical exclusion of access is not available
One nuance: ChatGPT Enterprise offers data residency — the ability to store data in an EU region. But this is a separate enterprise product priced at several thousand dollars per year, not standard API access. Most small and medium businesses use the standard API or ChatGPT Plus — with no region selection option.
Bottom line: if you use OpenAI FileSearch via the standard API for documents containing EU personal data — your data is stored in the US with no region choice. This requires a separate legal basis for cross-border transfer under GDPR Articles 44–49.
Summary: OpenAI FileSearch is a powerful tool, but GDPR-compliant enterprise use requires either an Enterprise plan or additional legal measures that most businesses simply do not implement.
Where Notion AI physically stores your data
Notion AI transfers your workspace content to sub-processors — Anthropic and OpenAI — to generate responses. Notion's servers are in the US (AWS). The Enterprise plan offers zero data retention at sub-processors, but not at Notion itself.
Notion is a popular corporate knowledge base platform. With the addition of Notion AI, businesses gained the ability to ask questions about their documents directly in the interface. But behind that convenience lies a more complex data processing chain.
What happens to your data in Notion AI according to official Notion documentation:
- ✔️ Sub-processors: Notion AI uses third-party LLM providers — including Anthropic and OpenAI. When you ask a question, relevant content from your workspace is sent to these providers to generate the response. The full list of sub-processors is available on the Notion AI security practices page
- ✔️ Notion server location: US, AWS infrastructure. Notion has signed SCCs for EU data transfers, but servers are physically in the US
- ✔️ Zero data retention at sub-processors: for Enterprise plans, sub-processors (Anthropic, OpenAI) do not retain data after processing the request. For standard plans — this is not guaranteed
- ✔️ Model training: Notion officially states it does not use customer data to train its own or third-party models
- ✔️ Encryption: data is encrypted in transit (TLS) and at rest (AES-256)
The core GDPR problem: even if Notion has a DPA and SCCs — your data still physically passes through multiple US companies (Notion → Anthropic or OpenAI). Each link in this chain is a potential liability point.
For businesses handling sensitive data this means: before using Notion AI you must sign a DPA with Notion, confirm your plan includes zero retention at sub-processors (i.e. Enterprise), conduct a DPIA and have a legal basis for transfer to the US. In practice — this is weeks of legal work.
Summary: Notion AI is convenient, but the sub-processor chain and US servers create GDPR compliance burden that most small and medium businesses simply do not realize when signing up.
What self-hosted means and how it differs architecturally
Self-hosted AI means all system components (database, vector index, documents and optionally the AI model itself) are deployed on your server. Data goes nowhere — it always stays with you.
Imagine the difference between two scenarios. In the first — you hand your documents to a third-party storage facility. Convenient, but they are no longer with you. In the second — you build your own archive room in your office. More responsibility, but full control.
A self-hosted AI document assistant works exactly on the second principle. Here is what the architecture consists of:
- ✔️ Your server (VPS): rented or owned server in any region — for GDPR compliance, Germany, Austria, the Netherlands or another EU country is chosen
- ✔️ Database with vector search: PostgreSQL with pgvector extension — stores your documents and vector indexes locally on the server
- ✔️ AI model (two options):
- Hybrid mode — LLM is external (OpenAI, Mistral via API), but only anonymized text fragments without file names and metadata are sent to it
- Closed loop — LLM is local (Ollama with Llama or Mistral), no request leaves your server
- ✔️ Chat interface: web widget or API, accessible only from allowed domains (origin filter)
From a GDPR perspective this architecture is fundamentally different: no external data processor, no cross-border transfer (with an EU server), no DPA required with an AI provider. Your company is both controller and de-facto processor — the entire chain of responsibility stays with you.
Important: self-hosted does not mean "build it yourself". AskYourDocs is deployed turnkey in 5–7 business days — from server setup to document upload and chat widget configuration. After project handover we have no technical access to your database or documents — you receive full control along with administrator credentials. More about the implementation process — on our services page →
Summary: self-hosted AI is neither complex nor expensive. It is a different architecture where your data never leaves your perimeter.
Comparison table: OpenAI vs Notion vs self-hosted
The main difference is not in answer quality — it is in where your data physically lives and who has access to it.
| Parameter | OpenAI FileSearch | Notion AI | AskYourDocs (self-hosted) |
|---|---|---|---|
| Where documents are stored | OpenAI servers (US) | Notion servers (US, AWS) | Your server (EU or anywhere) |
| Third parties with data access | OpenAI, Microsoft | Notion, Anthropic, OpenAI | None |
| Data transfer outside EU | Yes (US) | Yes (US) | No (with EU server) |
| DPA required | Yes | Yes | No |
| Model training on your data | No (API/Enterprise) | No (officially) | No (technically impossible) |
| Closed loop (no internet) | Not possible | Not possible | Yes (with Ollama) |
| GDPR without additional measures | No | No | Yes (with EU server) |
| LLM provider choice | OpenAI only | Notion/OpenAI/Anthropic only | Any |
| Implementation cost | Pay-per-use API | From $16/month/user | From $500 one-time |
| Vendor lock-in | Full | Full | None |
A few important notes on the table:
- ✔️ OpenAI Enterprise offers EU data residency — but this is a separate enterprise product with individual pricing, unavailable to most small and medium businesses
- ✔️ Notion Enterprise offers zero retention at sub-processors — but data is still stored on Notion servers in the US
- ✔️ "No model training" from cloud providers is an official statement — but technical verification is impossible
Summary: for businesses that take GDPR seriously — the table speaks for itself. Self-hosted solves the problem that cloud services only try to soften with contracts.