Personal, Agentic Assistants: A Practical Blueprint for a Secure, Multi-User, Self-Hosted Chatbot
Build a self-hosted, end-to-end platform that gives each user a personal, agentic chatbot that can autonomously search through files that the user explicitly allows it to access. Full control, 100% private, all the benefits of LLM without the privacy leaks, token costs or external dependencies.
This post walks through how I've built a self-hosted, end-to-end platform that gives each user a personal, agentic chatbot that can autonomously search through only the files that the user explicitly allows it to access.
In other words: full control, 100% private, all the benefits of LLM without the privacy leaks, token costs or external dependencies.
Intro
Over the past week, I challenged myself to build something that has been on my mind for a while:
how can I supercharge an LLM with my personal data without sacrificing privacy to big tech companies?
That led to this week's challenge:
build an agentic chatbot equipped with tools to access a user's personal notes securely, without compromising privacy.
As an extra challenge, I wanted the system to support multiple users. Not a shared assistant but a private agent for every user where user has full control over which files their agent can read and reason about.
We'll build the system in the following steps:
- Architecture
- How do we create an agent and provide it with tools?
- Flow 1: User file management: What happens when we submit a file?
- Flow 2: How do we embed documents and store files?
- Flow 3: What happens when we chat with our agentic assistant?
- Demonstration
- Conclusion
1) Architecture
I've defined three main "flows" that the system must allow:
A) User file management
Users authenticate through the frontend, upload or delete files and assign each file to specific groups that determine which users' agents may access it.
B) Embedding and storing files
Uploaded files are chunked, embedded and stored in the database in a way that ensures only authorized uses can retrieve or search those embeddings.
C) Chat
A user chats with their own agent. The agent is equipped with tools, including a semantic vector-search tool, and can only search documents the user has permission to access.
To support these flows, the system is composed of six key components:

App
A Python application that is the heart of the system. It exposes API endpoints for the front-end and listens for messages coming from the MessageQueue
Front-End
Normally I'd use Angular but for this prototype I went with Streamlit. It was very fast and easy to build with. This ease-of-use of course came with the downside of not being able to to everything I wanted. I'm planning on replacing this component with my go-to Angluar but in my opinion Streamlit was very nice for prototyping
Blob Storage
This container runs Minio; a open-source, high-performance, distributed object storage system. Definitely overkill for my prototype but it was very easy to use and integrates well with Python, so I have no regrets
(Vector) Database
Postgres handles all the relational data like document meta-data, users, usergroups and text-chunks. Additionally Postgres offers an extension that I use to save vector-data like the embeddings we're aiming to create. This is very convenient for my use-case since I can allow vector-search on a table, joining that table to the users-table, ensuring that each user can only see their own data.
Ollama
Ollama hosts two local models: one for embeddings and one for chat. The models are pretty light-weight but can be easily upgraded, depending on available hardware.
Message Queue
RabbitMQ makes the system responsive. Users don’t have to wait while large files are chunked and embedded. Instead, I return immediately and process the embedding in the background. It also gives me horizontal scalability: multiple workers can process files simultaneously.
2) Building an agent with a toolbox
LangGraph makes it easy to define an agent: what steps it can take, how it should reason and which tool it's allowed to use. This agent can then autonomously inspect the available tools, read their descriptions and decide whether calling one of them will help answer the user's question.
The workflow is described as a graph. Think of this a the blueprint for the agent's behavior. IN this prototype the graph is intentionally simple:

The LLM checks which tools are available and decides whether a tool-call (like vector search) is necessary. and The graph loops through the tool node and back to the LLM node until no more tools are needed and the agent has enough information to respond.
3) Flow 1: Submitting a file
This part describes what happens when a user submits one or more files. First a user has to log in to the front-end, receiving a token that is used to authenticate API calls.
After that they can upload files and assign those files to one or more groups. Any user in those groups will be allowed to access the file through their agent.

In the screenshot above the user selected two files; a PDF and a Word document, and assigns them to two groups. Behind the scenes, this is how the system processes an upload like this:

- The file and groups are sent to the API, validating the user with the token.
- The file is saved in the blob storage, returning the storage location
- The file's metadata and storage location is saved in the database, returning the
file_id - The
file_idis published to a message queue - the request is completed; the users can continue using the front-end. Heavy processes (chunking, embedding) happens later in the background)
This flow ensures the upload experience to stay fast and responsive, even for large files.
4) Flow 2: Embedding and storing files
Once a document is submitted, the next step is to make it searchable. In order to do this we need to embed our documents. This means that we convert the text from the document into numerical vectors that can capture semantic meaning.
In the previous flow we've submitted a message to the queue. This message only contains a file_id and thus is very small. This means that the system remains fast even when a user uploads dozens or hundreds of files.
The message queue also gives us two important benefits:
- it smooths out load by processing documents on-by-one in stead of all at once
- it future-proofs our system by allowing horizontal scaling; multiple workers can listen to the same queue and process files in parallel.
Here's what happens when the embedding worker receives a message:

- Take a message from the queue, the message contains a
file_id - Use
file_idto retrieve document meta data (filtering by user and allowed groups) - Use the
storage_locationfrom the meta data to download the file - The file is read, text-extracted and split into smaller chuks. Each chunk is embedded: it's sent to the local Ollama instance to generate an embedding.
- The chunks and their vectors are written to the database, alongside the file's access-control information
At this point, the document becomes fully searchable by the agent through vector search, but only for users who have been granted access.
5) Flow 3: Chatting with our Agent
With all components in place, we can start chatting with the agent.

When a user types a message, the system orchestrates several steps behind the scenes to deliver a fast and context-aware response:
- The user sends a prompt to the API and is authenticated since only authorized users can interact with their private agent.
- The app optionally retrieves previous messages so that the agent has a "memory" of the current conversation. This ensures that it can respond in the context of the ongoing conversation.
- The compiled LangGraph agent is invoked.
- The LLM, (running in Ollama) reasons and optionally uses tools. If needed, it calls the vector-search tool that we've defined in the graph, to find relevant document chunks the user is allowed to access.
The agent then incorporates those findings into its reasoning and decides whether it has enough information to provide an adequate response. - The agent’s answer is generated incrementally and streamed back to the user for a smooth, real-time chat experience.
At this point, the user is chatting with their own private, fully local agent that is equipped with the ability to semantically search through their personal notes.
6) DEMO
Let's see what this looks like in practice.
I've uploaded a word document with the following content:
Notes
On the 21st of November I spoke with a guy named “Gert Vektorman” that turned out to be a developer at a Groningen company called “super data solutions”. Turns out that he was very interested in implementing agentic RAG at his company. We’ve agreed to meet some time at the end of december.
Edit: I’ve asked Gert what his favorite programming language was; he like using Python
Edit: we’ve met and agreed to create a test implementation. We’ll call this project “project greenfield”I'll go to the front-end and upload this file.

After uploading, I can see in the front-end that:
- the document is stored in the database
- it has been embedded
- my agent has access to it
Now, let's chat.

As you see, the agent is able to respond with the information from our file. It's also surprisingly fast; this question was answered in a few seconds.
7) Conclusion
I love challenges that allow me to experiment with new tech and work across the whole stack, from database to agent graphs and front-end to the docker images. Designing the system and choosing a working architecture is something I always enjoy. It allows me to convert our goals into requirements, flows, architecture, components, code and eventually a working product.
This week’s challenge was exactly that: exploring and experimenting with private, multi-user, agentic RAG. I've built a working, expandable, reusable, scalable prototype that can be improved upon in the future. Most I've found that local, 100% private, agentic LLM's are possible.
Technical learnings
- Postgres + pgvector is powerful. Storing embeddings alongside relational metadata kept everything clean, consistent and easy to query since there was no need for an extra vector database.
- LangGraph makes it surprisingly easy to define an agent workflow, equip it with tools and let the agent decide when to use them
- Private, local, self-hosted agents are feasible. With Ollama running two lightweight models (one for chat, one for embeddings), everything runs on my MacBook with impressive speed
- Building a multi-tenant system with strict data isolation was a lot easier once architecture was clean and responsibilities were separated across components
- Loose coupling makes it easier to replace and scale components
Next steps
This system is ready for upgrades:
- Incremental re-embedding for documents that change over time
(so I can plug in my Obsidian vault seamlessly). - Citations that point the user to the exact files/pages/chunks the LLM used to answer my question used, improving trust and explainability.
- More tools for the agent — from structured summarizers to SQL access. Maybe even ontologies or user profiles?
- A richer frontend with better file management and user experience
I hope this article was as clear as I intended it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics.
Happy coding!
— Mike
P.s: like what I'm doing? Follow me!