Personal, Agentic Assistants: A Practical Blueprint for a Secure, Multi-User, Self-Hosted Chatbot

Build a self-hosted, end-to-end platform that gives each user a personal, agentic chatbot that can autonomously search through files that the user explicitly allows it to access. Full control, 100% private, all the benefits of LLM without the privacy leaks, token costs or external dependencies.

Mike Huls

Nov 22, 2025 — 8 min read

Photo by Google DeepMind / Unsplash

This post walks through how I've built a self-hosted, end-to-end platform that gives each user a personal, agentic chatbot that can autonomously search through only the files that the user explicitly allows it to access.

In other words: full control, 100% private, all the benefits of LLM without the privacy leaks, token costs or external dependencies.

Intro

Over the past week, I challenged myself to build something that has been on my mind for a while:
how can I supercharge an LLM with my personal data without sacrificing privacy to big tech companies?

That led to this week's challenge:

build an agentic chatbot equipped with tools to access a user's personal notes securely, without compromising privacy.

As an extra challenge, I wanted the system to support multiple users. Not a shared assistant but a private agent for every user where user has full control over which files their agent can read and reason about.

We'll build the system in the following steps:

Architecture
How do we create an agent and provide it with tools?
Flow 1: User file management: What happens when we submit a file?
Flow 2: How do we embed documents and store files?
Flow 3: What happens when we chat with our agentic assistant?
Demonstration
Conclusion

1) Architecture

I've defined three main "flows" that the system must allow:

A) User file management
Users authenticate through the frontend, upload or delete files and assign each file to specific groups that determine which users' agents may access it.

B) Embedding and storing files
Uploaded files are chunked, embedded and stored in the database in a way that ensures only authorized uses can retrieve or search those embeddings.

C) Chat
A user chats with their own agent. The agent is equipped with tools, including a semantic vector-search tool, and can only search documents the user has permission to access.

To support these flows, the system is composed of six key components:

App

A Python application that is the heart of the system. It exposes API endpoints for the front-end and listens for messages coming from the MessageQueue

Front-End

Normally I'd use Angular but for this prototype I went with Streamlit. It was very fast and easy to build with. This ease-of-use of course came with the downside of not being able to to everything I wanted. I'm planning on replacing this component with my go-to Angluar but in my opinion Streamlit was very nice for prototyping

Blob Storage

This container runs Minio; a open-source, high-performance, distributed object storage system. Definitely overkill for my prototype but it was very easy to use and integrates well with Python, so I have no regrets

(Vector) Database

Postgres handles all the relational data like document meta-data, users, usergroups and text-chunks. Additionally Postgres offers an extension that I use to save vector-data like the embeddings we're aiming to create. This is very convenient for my use-case since I can allow vector-search on a table, joining that table to the users-table, ensuring that each user can only see their own data.

Ollama

Ollama hosts two local models: one for embeddings and one for chat. The models are pretty light-weight but can be easily upgraded, depending on available hardware.

Message Queue

RabbitMQ makes the system responsive. Users don’t have to wait while large files are chunked and embedded. Instead, I return immediately and process the embedding in the background. It also gives me horizontal scalability: multiple workers can process files simultaneously.

2) Building an agent with a toolbox

LangGraph makes it easy to define an agent: what steps it can take, how it should reason and which tool it's allowed to use. This agent can then autonomously inspect the available tools, read their descriptions and decide whether calling one of them will help answer the user's question.

The workflow is described as a graph. Think of this a the blueprint for the agent's behavior. IN this prototype the graph is intentionally simple:

The LLM checks which tools are available and decides whether a tool-call (like vector search) is necessary. and The graph loops through the tool node and back to the LLM node until no more tools are needed and the agent has enough information to respond.

3) Flow 1: Submitting a file

This part describes what happens when a user submits one or more files. First a user has to log in to the front-end, receiving a token that is used to authenticate API calls.

After that they can upload files and assign those files to one or more groups. Any user in those groups will be allowed to access the file through their agent.

Adding files to the system (image by author)

In the screenshot above the user selected two files; a PDF and a Word document, and assigns them to two groups. Behind the scenes, this is how the system processes an upload like this:

The file and groups are sent to the API, validating the user with the token.
The file is saved in the blob storage, returning the storage location
The file's metadata and storage location is saved in the database, returning the file_id
The file_id is published to a message queue
the request is completed; the users can continue using the front-end. Heavy processes (chunking, embedding) happens later in the background)

This flow ensures the upload experience to stay fast and responsive, even for large files.

4) Flow 2: Embedding and storing files

Once a document is submitted, the next step is to make it searchable. In order to do this we need to embed our documents. This means that we convert the text from the document into numerical vectors that can capture semantic meaning.

In the previous flow we've submitted a message to the queue. This message only contains a file_id and thus is very small. This means that the system remains fast even when a user uploads dozens or hundreds of files.

The message queue also gives us two important benefits:

it smooths out load by processing documents on-by-one in stead of all at once
it future-proofs our system by allowing horizontal scaling; multiple workers can listen to the same queue and process files in parallel.

Here's what happens when the embedding worker receives a message:

Take a message from the queue, the message contains a file_id
Use file_id to retrieve document meta data (filtering by user and allowed groups)
Use the storage_location from the meta data to download the file
The file is read, text-extracted and split into smaller chuks. Each chunk is embedded: it's sent to the local Ollama instance to generate an embedding.
The chunks and their vectors are written to the database, alongside the file's access-control information

At this point, the document becomes fully searchable by the agent through vector search, but only for users who have been granted access.

5) Flow 3: Chatting with our Agent

With all components in place, we can start chatting with the agent.

When a user types a message, the system orchestrates several steps behind the scenes to deliver a fast and context-aware response:

The user sends a prompt to the API and is authenticated since only authorized users can interact with their private agent.
The app optionally retrieves previous messages so that the agent has a "memory" of the current conversation. This ensures that it can respond in the context of the ongoing conversation.
The compiled LangGraph agent is invoked.
The LLM, (running in Ollama) reasons and optionally uses tools. If needed, it calls the vector-search tool that we've defined in the graph, to find relevant document chunks the user is allowed to access.
The agent then incorporates those findings into its reasoning and decides whether it has enough information to provide an adequate response.
The agent’s answer is generated incrementally and streamed back to the user for a smooth, real-time chat experience.

At this point, the user is chatting with their own private, fully local agent that is equipped with the ability to semantically search through their personal notes.

6) DEMO

Let's see what this looks like in practice.

I've uploaded a word document with the following content:

Notes

On the 21st of November I spoke with a guy named “Gert Vektorman” that turned out to be a developer at a Groningen company called “super data solutions”. Turns out that he was very interested in implementing agentic RAG at his company. We’ve agreed to meet some time at the end of december. 
Edit: I’ve asked Gert what his favorite programming language was; he like using Python
Edit: we’ve met and agreed to create a test implementation. We’ll call this project “project greenfield”

I'll go to the front-end and upload this file.

After uploading, I can see in the front-end that:

the document is stored in the database
it has been embedded
my agent has access to it

Now, let's chat.

As you see, the agent is able to respond with the information from our file. It's also surprisingly fast; this question was answered in a few seconds.

7) Conclusion

I love challenges that allow me to experiment with new tech and work across the whole stack, from database to agent graphs and front-end to the docker images. Designing the system and choosing a working architecture is something I always enjoy. It allows me to convert our goals into requirements, flows, architecture, components, code and eventually a working product.

This week’s challenge was exactly that: exploring and experimenting with private, multi-user, agentic RAG. I've built a working, expandable, reusable, scalable prototype that can be improved upon in the future. Most I've found that local, 100% private, agentic LLM's are possible.

Technical learnings

Postgres + pgvector is powerful. Storing embeddings alongside relational metadata kept everything clean, consistent and easy to query since there was no need for an extra vector database.
LangGraph makes it surprisingly easy to define an agent workflow, equip it with tools and let the agent decide when to use them
Private, local, self-hosted agents are feasible. With Ollama running two lightweight models (one for chat, one for embeddings), everything runs on my MacBook with impressive speed
Building a multi-tenant system with strict data isolation was a lot easier once architecture was clean and responsibilities were separated across components
Loose coupling makes it easier to replace and scale components

Next steps

This system is ready for upgrades:

Incremental re-embedding for documents that change over time
(so I can plug in my Obsidian vault seamlessly).
Citations that point the user to the exact files/pages/chunks the LLM used to answer my question used, improving trust and explainability.
More tools for the agent — from structured summarizers to SQL access. Maybe even ontologies or user profiles?
A richer frontend with better file management and user experience

I hope this article was as clear as I intended it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics.

Happy coding!

— Mike

P.s: like what I'm doing? Follow me!