Prompt Engineering for Smaller LLMs: Tips for Developers
Master prompt engineering techniques for smaller language models to achieve accurate results in on-device and server-side AI applications.

The effectiveness of a large language model (LLM) hinges on the quality of instructions provided. Prompt engineering is the art and science of crafting prompts to elicit the most accurate and useful responses from an LLM. This process is critical when integrating LLM-based features into web applications, ensuring optimal performance and user satisfaction.
Prompt engineering is inherently iterative. If you've worked with various LLMs, you've likely noticed that refining prompts significantly improves output quality. This principle applies across models of all sizes, but it becomes especially crucial when dealing with smaller LLMs that have limited computational power and knowledge bases compared to their larger counterparts.
Chat interfaces powered by large LLMs, such as Gemini or ChatGPT, often deliver satisfactory results with minimal prompting effort. However, smaller, non-fine-tuned LLMs require a more tailored approach to achieve comparable results due to their reduced capacity and narrower knowledge pools.
What Are "Smaller LLMs"?
Defining the size of an LLM can be complex, as model parameters are not always publicly disclosed by developers. In this article, "smaller LLMs" refer to models with fewer than 30 billion parameters. As of today, models ranging from a few million to a few billion parameters can be executed in web browsers or on consumer-grade devices, making them ideal for specific use cases.
Where Are Smaller LLMs Used?
Smaller LLMs are particularly valuable in scenarios where computational resources, privacy, or latency are concerns. Below are two primary use cases:
On-Device/In-Browser Generative AI
Smaller LLMs are well-suited for on-device or in-browser applications due to their lower resource demands. For example:
- Gemma 2B with MediaPipe's LLM Inference API supports CPU-only devices, enabling lightweight AI features.
- Phi-2 with Transformers.js allows browser-based inference, ideal for privacy-sensitive applications.
Running smaller models on user devices ensures reasonable download sizes and compatibility with memory and CPU/GPU constraints, making them perfect for web-based AI features.
Custom Server-Side Generative AI
Small open-weight models, such as Gemma 2B, Gemma 9B, or Gemma 27B, can be deployed on custom servers. These models are flexible, allowing developers to fine-tune them for specific tasks, enhancing performance for niche applications.
Getting Started with Prompt Engineering
To achieve optimal results with smaller LLMs, prompts must be detailed, specific, and carefully structured. Unlike larger models, smaller LLMs often struggle with vague or underspecified prompts, leading to inaccurate or poorly formatted outputs.
Simple Prompt Example
Consider a basic prompt for rating a product based on a user review:
Input (review) | Larger LLM (Gemini 1.5) | Smaller LLM (Gemma 2B) |
---|---|---|
Absolutely love the fit! Distributes weight well and surprisingly comfortable even on all-day treks. Would recommend. | 5 | 4 out of 5 stars |
The straps are flimsy, and they started digging into my shoulders under heavy loads. | 1 | 2/5 |
The larger LLM (Gemini 1.5) correctly outputs an integer, while the smaller LLM (Gemma 2B) produces an incorrect format ("4 out of 5 stars" or "2/5") and less accurate ratings. This highlights the need for refined prompts when working with smaller models.
Improved Prompt with Few-Shot and Chain-of-Thought
To address these issues, use techniques like few-shot prompting (providing examples) and chain-of-thought prompting (guiding the model to reason through the task). Additionally, include explicit format instructions and emphasize the full rating range.
Here’s an improved prompt:
Input (review) | Output with Smaller LLM (Gemma 2B) |
---|---|
Absolutely love the fit! Distributes weight well and surprisingly comfortable even on all-day treks. Would recommend. | Analysis: The review is highly positive, emphasizing fit and comfort, indicating great satisfaction.<br>Rating: 5 |
The straps are flimsy, and they started digging into my shoulders under heavy loads. | Analysis: The review highlights a significant comfort issue with the straps, unsuitable for heavy loads.<br>Rating: 1 |
This refined prompt, leveraging few-shot examples and chain-of-thought reasoning, produces accurate ratings and consistent formats.
Manual Output Parsing
Smaller LLMs often produce less polished outputs, especially when using chain-of-thought prompting, which may include both analysis and ratings. Developers must manually parse outputs to extract the desired information.
Example JavaScript code for parsing:
This code ensures robust parsing, handling potential inconsistencies in the output format.
API Differences
Cloud APIs for larger LLMs, such as the Gemini API or OpenAI, often include advanced features like system instructions or JSON mode. In contrast, in-browser APIs for smaller LLMs, like MediaPipe LLM Inference or Transformers.js, are leaner and may lack these capabilities. Developers must account for these differences when designing prompts.
Token Limits
Smaller LLMs typically have lower input token limits (e.g., Gemma: 8K tokens; Gemini 1.5 Pro: 1M+ tokens). Detailed prompts, including examples, consume more tokens, increasing the risk of hitting limits.
Use a token estimation function to stay within bounds:
Time Estimates
Prompt engineering for smaller LLMs requires more time for design, testing, and validation due to API differences, token constraints, and output inconsistencies. Factor this into project timelines to ensure robust implementation.
Prompt Engineering vs. Fine-Tuning
For web developers, prompt engineering is often more practical than fine-tuning, especially for smaller LLMs.
When to Fine-Tune
- High Accuracy Needs: Fine-tuning optimizes model parameters for specific tasks.
- Labeled Data Available: Requires well-curated, task-specific data.
- Repetitive Tasks: Fine-tuning is efficient for repeated, consistent use cases.
When to Use Prompt Engineering
- Rapid Prototyping: Quickly test ideas without training overhead.
- Limited Data: Suitable when labeled data is scarce.
- Dynamic Use Cases: Ideal for frequently changing requirements.
- Resource Constraints: Avoids the need for training infrastructure.
- Fast Deployment: Enables quick integration into applications.
Key Takeaways
- Detailed Prompts for Smaller LLMs: Craft specific, structured prompts to compensate for limited capabilities.
- Leverage Few-Shot and Chain-of-Thought: Improve accuracy with examples and reasoning steps.
- Plan for Manual Parsing: Handle inconsistent outputs with robust parsing logic.
- Account for API and Token Limits: Adapt to leaner APIs and smaller token windows.
- Allocate Extra Testing Time: Ensure prompts and outputs meet requirements.
- Consider Fine-Tuning for Production: Use for specialized, high-accuracy tasks.
As a Software Engineer at Fab Web Studio, I've found these tips transform how I integrate AI features. Experiment with smaller LLMs on your projects—they're efficient and privacy-friendly. Reach out via fabwebstudio.com for custom web solutions!
Latest Posts

Shopify Headless Commerce: How to Use React or Vue for Custom Frontend Experiences
Manjeet Singh
Sep 17, 2025

Software Design Patterns: Your Guide to Building Better Code
Abhishek Bhardwaj
Sep 16, 2025

Build a Blazing-Fast, Scalable App with Next.js & Supabase: Step-by-Step Tutorial
Abhishek Bhardwaj
Sep 13, 2025

B2B Commerce with Magento: Latest Features and Integrations
Manjeet Singh
Sep 10, 2025