Prompt Engineering for Smaller LLMs: Tips for Developers

Master prompt engineering techniques for smaller language models to achieve accurate results in on-device and server-side AI applications.

Abhishek Bhardwaj

- Sep 21, 2025

Prompt Engineering for Smaller LLMs: Tips for Developers

The effectiveness of a large language model (LLM) hinges on the quality of instructions provided. Prompt engineering is the art and science of crafting prompts to elicit the most accurate and useful responses from an LLM. This process is critical when integrating LLM-based features into web applications, ensuring optimal performance and user satisfaction.

Prompt engineering is inherently iterative. If you've worked with various LLMs, you've likely noticed that refining prompts significantly improves output quality. This principle applies across models of all sizes, but it becomes especially crucial when dealing with smaller LLMs that have limited computational power and knowledge bases compared to their larger counterparts.

Chat interfaces powered by large LLMs, such as Gemini or ChatGPT, often deliver satisfactory results with minimal prompting effort. However, smaller, non-fine-tuned LLMs require a more tailored approach to achieve comparable results due to their reduced capacity and narrower knowledge pools.

What Are "Smaller LLMs"?

Defining the size of an LLM can be complex, as model parameters are not always publicly disclosed by developers. In this article, "smaller LLMs" refer to models with fewer than 30 billion parameters. As of today, models ranging from a few million to a few billion parameters can be executed in web browsers or on consumer-grade devices, making them ideal for specific use cases.

Where Are Smaller LLMs Used?

Smaller LLMs are particularly valuable in scenarios where computational resources, privacy, or latency are concerns. Below are two primary use cases:

On-Device/In-Browser Generative AI

Smaller LLMs are well-suited for on-device or in-browser applications due to their lower resource demands. For example:

Gemma 2B with MediaPipe's LLM Inference API supports CPU-only devices, enabling lightweight AI features.
Phi-2 with Transformers.js allows browser-based inference, ideal for privacy-sensitive applications.

Running smaller models on user devices ensures reasonable download sizes and compatibility with memory and CPU/GPU constraints, making them perfect for web-based AI features.

Custom Server-Side Generative AI

Small open-weight models, such as Gemma 2B, Gemma 9B, or Gemma 27B, can be deployed on custom servers. These models are flexible, allowing developers to fine-tune them for specific tasks, enhancing performance for niche applications.

Getting Started with Prompt Engineering

To achieve optimal results with smaller LLMs, prompts must be detailed, specific, and carefully structured. Unlike larger models, smaller LLMs often struggle with vague or underspecified prompts, leading to inaccurate or poorly formatted outputs.

Simple Prompt Example

Consider a basic prompt for rating a product based on a user review:


1Based on a user review, provide a product rating as an integer between 1 and 5.
2Only output the integer. Review: "${review}" Rating:

Input (review)	Larger LLM (Gemini 1.5)	Smaller LLM (Gemma 2B)
Absolutely love the fit! Distributes weight well and surprisingly comfortable even on all-day treks. Would recommend.	5	4 out of 5 stars
The straps are flimsy, and they started digging into my shoulders under heavy loads.	1	2/5

The larger LLM (Gemini 1.5) correctly outputs an integer, while the smaller LLM (Gemma 2B) produces an incorrect format ("4 out of 5 stars" or "2/5") and less accurate ratings. This highlights the need for refined prompts when working with smaller models.

Improved Prompt with Few-Shot and Chain-of-Thought

To address these issues, use techniques like few-shot prompting (providing examples) and chain-of-thought prompting (guiding the model to reason through the task). Additionally, include explicit format instructions and emphasize the full rating range.

Here’s an improved prompt:


1Analyze a product review, then provide the corresponding rating as an integer (1-5). Use 1 for strongly negative reviews (only issues mentioned) and 5 for strongly positive reviews (only positives mentioned). Be opinionated and use the full range of ratings (1 to 5). Provide a brief analysis followed by the rating.
2
3Examples:
4Review: 'Stylish and functional. Not sure how it'll handle rugged outdoor use, but it's perfect for urban exploring.'
5Analysis: The reviewer appreciates the product's style and functionality but expresses slight uncertainty about its ruggedness. Overall positive, but not exceptional.
6Rating: 4
7
8Review: 'It's a solid backpack at a decent price. Does the job, but nothing particularly amazing about it.'
9Analysis: The review reflects an average opinion. The backpack is functional but lacks standout features.
10Rating: 3
11
12Review: 'The waist belt broke on my first trip! Customer service was unresponsive too. Would not recommend.'
13Analysis: A serious defect and poor customer service result in strong dissatisfaction.
14Rating: 1
15
16Review: 'Love how many pockets and compartments it has. Keeps everything organized on long trips. Durable too!'
17Analysis: The review is enthusiastic, highlighting specific features like organization and durability, indicating great satisfaction.
18Rating: 5
19
20Review: 'The straps are a bit flimsy, and they started digging into my shoulders under heavy loads.'
21Analysis: A significant comfort issue with the straps, a critical component, results in a poor rating.
22Rating: 1
23
24Now assess this review:
25Review: "${review}"

Input (review)	Output with Smaller LLM (Gemma 2B)
Absolutely love the fit! Distributes weight well and surprisingly comfortable even on all-day treks. Would recommend.	Analysis: The review is highly positive, emphasizing fit and comfort, indicating great satisfaction.<br>Rating: 5
The straps are flimsy, and they started digging into my shoulders under heavy loads.	Analysis: The review highlights a significant comfort issue with the straps, unsuitable for heavy loads.<br>Rating: 1

This refined prompt, leveraging few-shot examples and chain-of-thought reasoning, produces accurate ratings and consistent formats.

Manual Output Parsing

Smaller LLMs often produce less polished outputs, especially when using chain-of-thought prompting, which may include both analysis and ratings. Developers must manually parse outputs to extract the desired information.

Example JavaScript code for parsing:

javascript
1try {
2  const intRegex = /[1-5]/;
3  const ratingMatch = llmOutput.match(intRegex);
4  if (ratingMatch) {
5    const ratingAsInt = parseInt(ratingMatch[0]);
6    if (ratingAsInt >= 1 && ratingAsInt <= 5) {
7      return ratingAsInt;
8    }
9  }
10  throw new Error('No valid rating found in output');
11} catch (e) {
12  console.error('Error parsing LLM output:', e);
13  return null;
14}

This code ensures robust parsing, handling potential inconsistencies in the output format.

API Differences

Cloud APIs for larger LLMs, such as the Gemini API or OpenAI, often include advanced features like system instructions or JSON mode. In contrast, in-browser APIs for smaller LLMs, like MediaPipe LLM Inference or Transformers.js, are leaner and may lack these capabilities. Developers must account for these differences when designing prompts.

Token Limits

Smaller LLMs typically have lower input token limits (e.g., Gemma: 8K tokens; Gemini 1.5 Pro: 1M+ tokens). Detailed prompts, including examples, consume more tokens, increasing the risk of hitting limits.

Use a token estimation function to stay within bounds:

javascript
1function estimateTokens(text) {
2  // Rough estimation: 1 token ≈ 4 characters in English
3  return Math.ceil(text.length / 4);
4}
5
6function checkTokenLimit(prompt, maxTokens) {
7  const tokenCount = estimateTokens(prompt);
8  return {
9    isWithinLimit: tokenCount <= maxTokens,
10    tokensUsed: tokenCount,
11    tokensRemaining: maxTokens - tokenCount
12  };
13}
14
15const prompt = "Your detailed prompt here...";
16const result = checkTokenLimit(prompt, 8000);
17console.log(result);

Time Estimates

Prompt engineering for smaller LLMs requires more time for design, testing, and validation due to API differences, token constraints, and output inconsistencies. Factor this into project timelines to ensure robust implementation.

Prompt Engineering vs. Fine-Tuning

For web developers, prompt engineering is often more practical than fine-tuning, especially for smaller LLMs.

When to Fine-Tune

High Accuracy Needs: Fine-tuning optimizes model parameters for specific tasks.
Labeled Data Available: Requires well-curated, task-specific data.
Repetitive Tasks: Fine-tuning is efficient for repeated, consistent use cases.

When to Use Prompt Engineering

Rapid Prototyping: Quickly test ideas without training overhead.
Limited Data: Suitable when labeled data is scarce.
Dynamic Use Cases: Ideal for frequently changing requirements.
Resource Constraints: Avoids the need for training infrastructure.
Fast Deployment: Enables quick integration into applications.

Key Takeaways

Detailed Prompts for Smaller LLMs: Craft specific, structured prompts to compensate for limited capabilities.
Leverage Few-Shot and Chain-of-Thought: Improve accuracy with examples and reasoning steps.
Plan for Manual Parsing: Handle inconsistent outputs with robust parsing logic.
Account for API and Token Limits: Adapt to leaner APIs and smaller token windows.
Allocate Extra Testing Time: Ensure prompts and outputs meet requirements.
Consider Fine-Tuning for Production: Use for specialized, high-accuracy tasks.

As a Software Engineer at Fab Web Studio, I've found these tips transform how I integrate AI features. Experiment with smaller LLMs on your projects—they're efficient and privacy-friendly. Reach out via fabwebstudio.com for custom web solutions!