The full details of this project can be found at the GitHub repository: https://github.com/jianyangg/local-llm.
your conservational assistant, powered by your documents
Uploading highly sensitive documents to the cloud carries inherent cybersecurity risks. Storing these data on-premises and utilising Large Language Models to synthesise knowledge from them can streamline workflows while mitigating security concerns associated with sensitive documents.
Existing open-source RAG applications often lack multi-tenant support and rely on traditional document storage methods like vector databases, potentially limiting their performance. This project aims to address these limitations by developing a solution tailored for multi-tenant environments and exploring alternative approaches to document storage for improved performance.
Neo4j
and concurrent instances of Ollama
running Meta’s Llama 3.1
.Initial two weeks focused on understanding Large Language Models (LLMs), RAG, and data retrieval mechanisms. Key concepts studied included vector databases and Retrieval Augmented Generation. Resources explored included LangGraph, Ollama, Meta’s Llama 3.1, Neo4j, and Docker. This foundational research built essential knowledge for the practical work in subsequent weeks.
During these weeks, I created a basic proof-of-concept for a multi-tenant Retrieval Augmented Generation (RAG) application that can run locally, supporting multiple tenants concurrently. Here’s a detailed breakdown of each aspect worked on:
To support multi-tenancy, I ran multiple instances of a Large Language Model (LLM) on-premises. I evaluated options like HuggingFace’s Text Generation Inference and vLLM, ultimately choosing Ollama for its simplicity and ease of managing multiple LLM instances. Ollama’s concurrent inference feature aligned with multi-tenancy requirements, as it determines system capacity before initiating new LLM instances. Meta’s Llama 3.1 model (8 billion parameters) was selected, balancing performance with resource efficiency.
Document chunking was explored to improve document retrieval efficiency. Initially, fixed-length chunking was tested but proved inefficient, as chunks often contained unrelated topics, confusing the retriever. After experimenting, LLMSherpa was chosen for its superior OCR capabilities in parsing documents. LLMSherpa identifies text and tables within a PDF, organizing content hierarchically by recognizing section headers, thus preserving essential context.
Tenant-specific data access required isolated vector indexes within a single database. Neo4j was selected for its knowledge graph and vector database capabilities, but due to the limitations of its community edition, I implemented a unique tenant identification system based on hashed credentials to secure tenant data.
Enhanced the Streamlit UI for front-end document uploads and containerized the application using Docker Compose for one-click deployment.
Implemented BERTopic for topic extraction, enabling topic-based retrieval alongside vector database search.
Integrated bounding boxes to allow users to verify answers against original document sources.
Developed an actor-critic feedback loop, using a rephraser agent and critic agent to improve the quality of generated answers. The exact architecture is given in the architecture diagram above.
Finalized documentation, created presentation slides, and made the project repository accessible on GitHub: https://github.com/jianyangg/local-llm
The actor-critic workflow, though accurate, slows down execution. Exploring hardware acceleration options could improve performance.
Current OCR-based chunking could be enhanced with ensemble or semantic methods for better retrieval and answer generation.
Investigating flexible document chunk limits may prevent exclusion of relevant information in large documents.
More objective evaluation metrics like Ragas could refine performance assessment.
Further research is needed to improve Cypher command generation for a more accurate knowledge graph structure.
The project demonstrates a viable multi-tenant RAG prototype with an actor-critic framework and topic modeling, enhancing RAG performance. Future improvements will focus on refining parsing and chunking methods for improved results.
I am deeply grateful for the opportunity to work at Defence Science & Technology Agency (Enterprise Digital Services). Special mention to my internship mentors Yong Han Ching and Benjamin Lau Chueng Kiat for their unwavering guidance, support, and encouragement throughout my internship. Your expertise and insights have been instrumental in this journey, and I am truly appreciative of the time and effort you invested in my development.
I am also thankful to the entire team at Enterprise Digital Services for fostering a welcoming and collaborative environment. It was truly enjoyable working there.