How to choose the right LLM for your business?
Interview with Nedžad Alibegović, Cloud Services (Backend) Engineer
Everything matters when selecting a large language model (LLM). In addition to proprietary LLMs such as OpenAI, the tech world has seen a rapid rise in open-source LLMs run locally. These models offer full control, complete customization, and data training the way you want.
But are they the best choice for your business or product? In September's issue of Klika's newsletter Techtonic, we sat down with Klika's Cloud Services (Backend) Engineer Nedžad Alibegović to discuss the pros and cons of local LLMs, best practices, and his experience.
What are the advantages and disadvantages of using local LLMs instead of proprietary models such as ChatGPT?
By far, the most significant advantages of using a local model are privacy and data security. When running models locally, we ensure that the data and outputs of the models stay within our infrastructure before we put on our tinfoil hats and speculate whether proprietary models use user data for further training (yes and no; for example, ChatGPT Enterprise doesn't use user data for training and it's also possible to opt-out in case of the regular consumer plans). In many cases, using a 3rd-party data processor is not possible due to regulatory compliance requirements, such as GDPR, HIPAA, or other user agreements.
Another important aspect of running LLMs locally is that the device we're running on doesn't need internet access. This is beneficial when maximum security is required, but more commonly, it would allow embedded or portable devices to run LLMs when internet connectivity is difficult.
Additionally, running LLMs in-house rather than consuming an API offers resiliency in case of any service issue with an upstream provider. Combine this with the fact that open-source models can come in multiple parameter sizes and the general industry push to include NPUs (Neural Processing Units) on most modern SoCs (System on a Chip). It's easy to imagine a future with truly smart devices around us, supercharged with open-source generative models.
While proprietary models offer services to customize them for specific use cases (e.g., fine-tuning, RAG), doing so in practice may be difficult. With open-source models, there are usually many more options for customizing and optimizing them for our specific use case. Do we need domain-specific knowledge? Fine-tuning. Does the model need up-to-date information for a given dataset? RAG. Do we need to run the model on a less powerful device? Quantization.
What about the setup and running fees?
While proprietary models may provide all these options, they're not guaranteed. In contrast, with open-source models, we can customize and integrate to our heart's content at a predictable and fixed cost. Running things locally does require an up-front fee that's fixed (be it buying the hardware or renting a server), which is more attractive to larger organizations, especially during the initial exploration phases, where training a model or fine-tuning a pre-trained model may incur significant costs or if we're dealing with high volumes of queries where the pay-as-you-go model doesn't make sense.
As with all managed vs. unmanaged service debates, the same goes for LLMs-the privacy, security, and freedom we get come at a price. One is the infrastructure required to run these full-fledged models without lowering precision.
For example, running the full-fledged Llama 3.1 405B model with FP16 precision would require more than 800 GB of VRAM (that's a lot of GPUs) to load the model into memory! This doesn't make sense for personal use or small organizations, where the pay-as-you-go model is much more attractive. It allows them to have the same capabilities as larger organizations without a large initial investment.
Maintenance and scalability are also factors here, as having on-premises LLM deployments requires staff to maintain everything and significant investments in actual hardware to scale.
What could be considered the optimal approach when training local LLMs? Would you recommend training or optimizing the pre-trained model?
While it's possible to train your LLM from the ground up, optimizing a pre-trained model for a business problem is much more common. We usually use three methods to optimize a pre-trained model: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning.
These three methods don't necessarily have to happen in the simple linear order listed previously, but they do happen that way more often than not. If we need to optimize for response accuracy-either by providing up-to-date information or domain-specific or proprietary business information-RAG is the way to go.
On the other hand, if we need to increase the consistency of the LLMs output to follow a specific structure, format, or style or if reasoning is not followed consistently, we can turn to fine-tuning. The initial prompt also heavily affects both of the mentioned optimizations. Generally, prompt engineering is the best place to start when looking to optimize things, rewriting the prompt by adding context, instructions, or examples until we get the desired result or start going in the direction we want.
Imagine I am a business or product owner deciding which model to use. What questions should I answer to guide me in the right direction?
The fundamentals should be taken care of from the start:
What is the primary task the model needs to accomplish? Are you doing text generation, text analysis, summarization, or sentiment analysis?
What is the level of accuracy you require? Do you want something more predictable and reliable or creative and prone to hallucinations? Is there a need for fast responses or more complex problem-solving?
Do you need domain-specific knowledge? Do you need access to up-to-date information at all times?
Answering these questions above should give you better insights and waypoints as to whether you need a model with good reasoning capabilities or one that maybe has better creativity capabilities; it should tell us how much we should lock down a model and whether we need a high parameter count or perhaps something more efficient will do the job.
Finally, these questions should tell us whether we can use a pre-trained model as is or whether we need to augment it with domain-specific knowledge or provide it with access to additional datasets.
After getting the fundamentals right, we can start thinking about privacy & security (how sensitive is the data, do we need to comply with any regulations or other agreements, is it safe to share this data with 3rd parties), scalability & performance (how many requests will be handled, are there any response time requirements, how quickly should we scale in case of increased usage), budget and maintenance (what's the initial funding, long-term costs, staff to maintain infrastructure in case of local deployments, vendor lock-in if using a specific model, integrations with other services and possible customizability).
Regarding usual practices, which products would you advise going with immediately using local LLMs?
Any product handling highly sensitive data should use local LLMs, as should any healthcare-related applications, financial services, or services that handle personal user data. I would also consider local LLMs in products that need high availability, not necessarily as the primary LLM but as a fallback in case an upstream proprietary model goes down for some reason.
What are your experiences with the speed and ease of integration of local LLMs?
When integrating a local LLM into a product stack, we have many more customization options than proprietary models, as their respective APIs do not necessarily limit us. Depending on the requirements, running open-source models locally on our hardware may not even be necessary as many cloud providers have managed solutions for running open-source models (Amazon Bedrock, for example), where we can enjoy the benefits of open-source models and the pay-as-you-go pricing structure of proprietary models.
Many local LLM runners (one of the most popular being Ollama) also provide OpenAI-compatible APIs, so migrating from ChatGPT is as easy as changing the request URL. Additionally, changing models when running locally is as easy as pushing a button, allowing us to quickly iterate and test different models for a given product stack.
At this point, can local and open-source LLMs be considered a threat to giants such as OpenAI? If not, what goals should local LLMs prioritize next to become viable competitors?
The answer is - it depends. When looking at use cases for integrating LLMs into existing products for specific tasks, I would say it's close, and there's a genuine case to be made for using open-source LLMs, especially when fine-tuning them on product-specific data.
On the other hand, proprietary models still have the edge regarding intelligence and reasoning in general. This also comes down to who's backing these models, what data they have access to, and, mainly, how many resources those entities have.
While any organization can compete in training their LLMs on a small scale, training a 'generalist' billion-parameter model requires swaths of computing power and training data, to some of which giants like OpenAI, Google, and Meta have exclusive access.
I believe one of the biggest hurdles open-source LLMs will face is access to the training data that proprietary models have. In contrast, many open-source LLMs are open-source and open-weight; the training data, in many cases, is not available to the public and is often copyrighted and licensed to the specific entity that trained the LLM.
However, if someone decides to use a proprietary model such as ChatGPT, which one would you recommend and why?
ChatGPT has become ubiquitous with LLMs. It's the go-to option for everyone and everything. It's multimodal (it can see, hear, and speak). Their API features fine-tuning and RAG capabilities, and recently, they introduced their o1 models that perform complex reasoning.
Another interesting model is Anthropic's Claude; its writing style is slightly less robotic than ChatGPT's. It trades blows with ChatGPT-4 in several benchmarks. Code generation seems to be a bit better on Claude. It features a larger context window of 200K tokens (ChatGPT 4o is 128K tokens).
A downside of Anthropic's API is that it doesn't natively support RAG and fine-tuning like OpenAI's offerings (but it offers its model on Amazon Bedrock, where RAG and fine-tuning are supported).
You had hands-on experience with local LLM when your team implemented it into the POC. Could you walk us through the procedure, what use local LLM served, what challenges were met or surpassed, and how?
As with the previous question about which questions should be answered before choosing a model, we reviewed the same questionnaire and identified the fundamentals we needed to cover. For our specific use case, we needed to extract metadata from user-provided CSV files; thus, the level of accuracy required was very high, with consistently reproducible outputs and quick responses.
These files contain user data, so we also had to respect privacy requirements. Following these guidelines, we opted to use Llama 3.1 8B, running locally to keep our data private and protected. One of the first hurdles we encountered was hallucinations and inconsistent outputs; we fixed these by rewriting our system prompt, lowering the model's temperature, and adding a seed to get reproducible outputs.
After that, multiple system iterations prompt fine-tuning of the output. Another issue we encountered was context-length limits, where it's easy to overwhelm the 8K-token limit of Llama 3.1 8B with a large CSV file.
We are currently truncating files to fit this limit, but we're actively working on using RAG to load entire files into the LLM to retrieve information about the entire dataset. In the future, we're also looking into fine-tuning by training on files uploaded by users on our platform.
How do you see the future perspective of local/open-source versus proprietary regarding costs and privacy amid current sociopolitical changes?
With the industry-wide push for AI and the technological advances we see with portable devices being able to run on-device models, we will spotlight compact, more energy-efficient models. While there will be proprietary models, I think this will also accelerate the release of open-source models.
We already have impressive open-source models that can run on consumer hardware; if the big player joins in to create these smaller models, we will see a renaissance of sorts for small LLM models. This, in turn, would allow for more ubiquitous integration of LLMs for specific tasks throughout existing product stacks.









