
Motivation
If you’re part of any enterprise that wants to use Large Language Models (LLMs) like ChatGPT, Mistral, Gemini etc., you know there is always some concern around exposing sensitive organizational data to these LLMs. Databricks acquired Mosaic AI in 2023 which helps tackle this exact problems – allows users to use any LLM without compromising data governance policies. However, we still get a lot of questions from our clients around data privacy, and this post attempts to answer some of them.
Let’s start with a basic one.
Q1: What is Mosaic AI, and how is it different from an LLM?
An LLM (Large Language Model) is a reasoning engine trained on large text compilation (for example, all of Wikipedia).
Mosaic AI is an enterprise AI platform built by Databricks that operationalizes LLMs safely.
Think of it this way:
- LLM understands and generates language
- Mosaic AI controls how, when, and with what data the LLM is used
Mosaic AI is responsible for governance, orchestration, auditing, and security.
If we use a car analogy,
LLM = A car’s engine
Mosaic AI = The manufacturing plant, safety controls, and the switches on the dashboard
Q2: If I use ChatGPT via Mosaic AI, does OpenAI see my data?
Not unless you explicitly choose to do so. And don’t worry, there are plenty of warnings around this.
Anytime Mosaic AI uses models from providers like OpenAI, Google, etc., it does so using enterprise APIs which have strict guarantees like:
- Your data will never be used for training
- Your data is not retained
- Your data is not shared across customers
Also (more importantly), the LLM never has direct access to your databases, data lakes or schemas.
Q3: How does Mosaic AI answer business questions without exposing data?
We will explain this using an example.
Let’s say you have an employee table in Delta Lake with a date_of_birth column.
A user asks:
“What is John’s date of birth?”
What does not happen
- The employee table is not sent to ChatGPT
- The table DDL (schema information) is not sent
- The LLM does not query the database
What actually happens
- Intent interpretation is done by the LLM
The LLM is used only to understand the meaning of the question. - Secure query execution is done by Databricks
Mosaic AI translates intent into a SQL query and executes it inside Databricks, where:- Unity Catalog enforces column-level and row-level security
- Access is audited and logged
- Minimal response formatting by the LLM
Only the final value (for example, 1987-04-12) may be passed back to the LLM so it can respond naturally.
None of the following are visible to the LLM:
- The full table
- Other employees’ data
- The database schema
Q4: Doesn’t the LLM need the table schema (DDL) to work?
Short answer, no.
If you’re developing an end user application directly using OpenAI’s API and trying to create an SQL from natural language query, you may need to disclose schema information. But here, we are talking about an enterprise grade tool, designed with privacy and data governance at its core.
Mosaic AI uses one of two secure patterns:
Pattern A: Controlled metadata hints
If needed, Mosaic AI may provide a minimal, sanitized description, such as:
“There is an employee table with a
date_of_birthfield.”
This is not DDL, contains no data, and is fully controlled.
Pattern B: Tool-based execution (most secure)
The LLM is allowed to call a predefined tool, for example:
{
"tool": "get_employee_dob",
"arguments": { "name": "John" }
}
The LLM never sees schemas or SQL. It can only invoke approved tools.
Q5: How Mosaic AI guarantees that sensitive data is not leaked?
There are three independent guarantee here, and all must fail for a breach to occur.
1. Architectural isolation (strongest)
LLM endpoints have no network access
No credentials to Delta Lake
No access to Unity Catalog
This is enforced by infrastructure, not policy.
2. Platform governance
Tool-based execution (not free-form prompts) – Important!
Unity Catalog enforces PII, column, and row security
Full prompt and response logging
Deterministic prompt construction
3. Contractual guarantees
Enterprise API contracts prohibit training and retention
Legal and compliance backstops exist even in misconfiguration scenarios (we mentioned guardrails earlier)
Q6: What would actually violate data privacy?
Very few things!
Here are a few…
Bypassing Mosaic AI and calling LLM APIs directly
Manually embedding raw tables into prompts
Using consumer ChatGPT with organizational data
Mosaic AI is designed specifically to prevent these patterns at scale.
Q7: Is Mosaic AI “business-specific AI” without training the model?
Yes.
The AI becomes business-specific not because the model is retrained, but because it’s carefully connected to your data. That happens through things like retrieving the right information at the right time (RAG), running queries in a controlled and governed way, and enforcing who is allowed to see what (Unity Catalog).
The language model itself doesn’t “remember” your data. It simply reasons over the small, approved pieces of information it’s given in the moment, and then lets go.
Going with the car analogy:
The engine never chooses the fuel.
The engine never opens the hood.
The engine (almost) never leaves the car
Summary
Mosaic AI keeps organizational data inside your enterprise boundary and uses LLMs only as controlled, stateless reasoning components (never as data owners).
Pssst … If you liked this article, consider checking out some of our other Byte Sized Articles.




