Core prompt learning techniques

Basic techniques

Prompt engineering involves understanding the fundamental behavior of LLMs to construct prompts effectively. Prompts consist of different components: instructions, primary content, examples, cues, and supporting content (also known as additional context or knowledge). Instructions guide the model on what to do, while primary content is the main text being processed. Examples provide desired behavior demonstrations, while cues act as a jumpstart for the model’s output. Supporting content provides additional information to influence the output, such as knowledge to search for before answering. By strategically combining these elements, you can design prompts that elicit the desired responses from the model.

This section covers basic techniques for mastering the art of prompting.

Zero-shot scenarios

Whenever a task, assigned to a model through a prompt, is given without any specific example of the desired output, it’s called zero-shot prompting. Basic scenarios might include:

  • Proper text completion For example, writing an email or a medical record

  • Topic extraction For example, to classify customers’ emails

  • Translations and sentiment analysis For example, to label as positive/negative a tweet or to translate users’ reviews to the same language

  • Style-impersonation For example, Shakespeare, Hemingway, or any other notorious personality the model may have been trained on.

The basic theory

Given the huge amount of data that LLMs are trained on, these models can perform an increasing number of “zero-shot” tasks. In fact, the model already knows how to act in those cases. It has incorporated the ability to perform these tasks during its considerable training. However, the more clarity expressed in the prompt, the more effective and accurate the output will be.

A few examples

A basic example of a zero-shot prompt might look like this:

Extract sentiment from the following text delimited by triple backticks.
'''Language models have revolutionized the way we interact with technology, empowering us to
generate creative content, explore new ideas, and enhance our communication. LLMs offer immense
potential for unlocking innovation and improving various aspects of our lives, opening up
exciting possibilities for the future.'''

An output would be:

The sentiment of the provided text is positive.

Trying topic extractions, it would be:

Determine at most three topics that are being discussed in the following text, delimited by
triple backticks.


'''Language models have revolutionized the way we interact with technology, empowering us to
generate creative content, explore new ideas, and enhance our communication. LLMs offer immense
potential for unlocking innovation and improving various aspects of our lives, opening up
exciting possibilities for the future.'''

Here’s a possible resulting output:

The three topics discussed in the text are:
The impact of language models on technology interaction: Language models have revolutionized the
way we interact with technology, enabling us to generate creative content, explore new ideas,
and enhance communication.
The potential of LLMs for innovation: LLMs offer immense potential for unlocking innovation and
improving various aspects of our lives. They open up exciting possibilities for the future.
Benefits and improvements in various aspects of life: The text highlights how LLMs can enhance
communication, generate creative content, and explore new ideas, thereby improving different
aspects of our lives.

Iterative refining

Prompt engineering is a matter of refining. Trying to improve the preceding result, you might want to explicitly list the sentiment the model should output and to limit the output to the sentiment only. For example, a slightly improved prompt might look like the following:

Extract sentiment (positive, neutral, negative, unknown) from the following text, delimited by
triple backticks.
'''Language models have revolutionized the way we interact with technology, empowering us to
generate creative content, explore new ideas, and enhance our communication. LLMs offer immense
potential for unlocking innovation and improving various aspects of our lives, opening up
exciting possibilities for the future.'''
Return only one word indicating the sentiment.

This would result in the following output:

Positive

Likewise, regarding the topic extraction, you might want only one or two words per topic, each separated by commas:

Determine at most three topics that are being discussed in the following text, delimited by
triple backticks.
Format the response as a list of at most 2 words, separated by commas.
'''Language models have revolutionized the way we interact with technology, empowering us to
generate creative content, explore new ideas, and enhance our communication. LLMs offer immense
potential for unlocking innovation and improving various aspects of our lives, opening up
exciting possibilities for the future.'''

The result would look like:

Language models, Interaction with technology, LLM potential.

Few-shot scenarios

Zero-shot capabilities are impressive but face important limitations when tackling complex tasks. This is where few-shot prompting comes in handy. Few-shot prompting allows for in-context learning by providing demonstrations within the prompt to guide the model’s performance.

A few-shot prompt consists of several examples, or shots, which condition the model to generate responses in subsequent instances. While a single example may suffice for basic tasks, more challenging scenarios call for increasing numbers of demonstrations.

When using the Chat Completion API, few-shot learning examples can be included in the system message or, more often, in the messages array as user/assistant interactions following the initial system message.

The basic theory

The concept of few-shot (or in-context) learning emerged as an alternative to fine-tuning models on task-specific datasets. Fine-tuning requires the availability of a base model. OpenAI’s available base models are GPT-3.5-turbo, davinci, curie, babbage, and ada, but not the latest GPT-4 and GPT-4-turbo models. Fine-tuning also requires a lot of well-formatted and validated data. In this context, developed as LLM sizes grew significantly, few-shot learning offers advantages over fine-tuning, reducing data requirements and mitigating the risk of overfitting, typical of any machine learning solution.

This approach focuses on priming the model for inference within specific conversations or contexts. It has demonstrated competitive performance compared to fine-tuned models in tasks like translation, question answering, word unscrambling, and sentence construction. However, the inner workings of in-context learning and the contributions of different aspects of shots to task performance remain less understood.

Recent research has shown that ground truth demonstrations are not essential, as randomly replacing correct labels has minimal impact on classification and multiple-choice tasks. Instead, other aspects of demonstrations, such as the label space, input text distribution, and sequence format, play crucial roles in driving performance. For instance, the two following prompts for sentiment analysis—the first with correct labels, and the second with completely wrong labels —offer similar performance.

Tweet: "I hate it when I have no wifi"
Sentiment: Negative
Tweet: "Loved that movie"
Sentiment: Positive
Tweet: "Great car!!!"
Sentiment: Positive

Tweet: {new tweet}
Sentiment:

And:

Tweet: "I hate it when I have no wifi"
Sentiment: Positive
Tweet: "Loved that movie"
Sentiment: Negative
Tweet: "Great car!!!"
Sentiment: Negative

Tweet: {new tweet}
Sentiment:

In-context learning may struggle with tasks that lack precaptured input-label correspondence. This suggests that the intrinsic ability to perform a task is obtained during training, with demonstrations (or shots) primarily serving as a task locator.

A few examples

One of the most famous examples of the efficiency of few-shot learning prompts is one taken from a paper by Brown et al. (2020), where the task is to correctly use a new word in a sentence:

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. To do a
"farduddle" means to jump up and down really fast. An example of a sentence that uses the word
farduddle is:

The model would correctly output something like:

We were so excited that we couldn't help but farduddle when our favorite team won the
championship.

A very good use case for few-shot learning is writing something in a given style, such as code documentation or social media posts. Whenever there is a pattern or a format, and explaining it is more complex than showing it, it’s worth trying few-shot prompting. For instance, the following prompt would generate a consistent documentation:

/// <summary>
/// Returns detailed information about a specific event
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="year">Year of the event</param>
/// <param name="tournamentId">Internal ID of the event</param>
/// <returns>Detailed info about the event</returns>
/// <remarks>Sample events to try are: 2892, 2331, 1028. Try also 8392.
/// If any of the parameters are invalid, the endpoint returns 404.</remarks>
public IActionResult Tournament(string org, int? year, string eventId)

public IActionResult Team(string org, string teamId)

Based on the model version and the API (Completion or Chat Completion), the system might tend to explain the code and the documentation instead of producing new documentation for the presented method (IActionResult Team). That’s why a better and more precise prompt would be the following, which includes specific instructions:

Your task is to generate documentation in a consistent style.
/// <summary>
/// Returns detailed information about a specific event
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="year">Year of the event</param>
/// <param name="tournamentId">Internal ID of the event</param>
/// <returns>Detailed info about the event</returns>
/// <remarks>Sample events to try are: 2892, 2331, 1028. Try also 8392.
/// If any of the parameters are invalid, the endpoint returns 404.</remarks>
public IActionResult Tournament(string org, int? year, string eventId)

public IActionResult Team(string org, string teamId)

Iterative refining

The last prompt is likely to generate a similar response:

/// <summary>
/// Returns detailed information about a specific event.
/// </summary>
/// <param name="org">The name of the organization.</param>
/// <param name="year">The year of the event.</param>
/// <param name="tournamentId">The internal ID of the event.</param>
/// <returns>Detailed information about the event.</returns>
/// <remarks>
/// Sample event IDs to try are: 2892, 2331, 1028. You can also try 8392.
/// If any of the parameters are invalid, the endpoint returns a 404 error.
/// </remarks>
public IActionResult Event(string org, int? year, string eventId) {
// Implementation code here
}
/// <summary>
/// Retrieves information about a specific team.
 /// </summary>
/// <param name="org">The name of the organization.</param>
/// <param name="teamId">The ID of the team.</param>
/// <returns>Information about the team.</returns>
public IActionResult Team(string org, string teamId) {
// Implementation code here
}
The Event method is an action that returns detailed information about a specific event. It takes
three parameters: org (representing the name of the organization), year (representing the year
of the event, nullable), and eventId (representing the internal ID of the event). This method
should be called to retrieve detailed information about an event.
The Team method is another action that retrieves information about a specific team. It takes two
parameters: org (representing the name of the organization) and teamId (representing the ID of
the team). This method should be called to fetch information about a team.

As is easily observable from the previous examples, all these generative models want to write text. This is reasonable because they were trained to predict the next token. It is very likely, then, that a good prompt needs to incorporate something to restrict the output.

Following up with the last example, a better prompt might be:

Your task is to generate code comments in a consistent style with the reference. The response
must include only the comments for the new method, without any additional explanation or
comment.

Style reference:

/// <summary>
/// Returns detailed information about a specific event
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="year">Year of the event</param>
/// <param name="tournamentId">Internal ID of the event</param>
/// <returns>Detailed info about the event</returns>
/// <remarks>Sample events to try are: 2892, 2331, 1028. Try also 8392.
/// If any of the parameters are invalid, the endpoint returns 404.</remarks>
public IActionResult Tournament(string org, int? year, string eventId)

New method:

public IActionResult Team(string org, string teamId)

This might still output some additional explanation:

/// <summary>
/// Returns detailed information about a specific team.
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="teamId">ID of the team</param>
/// <returns>Detailed info about the team</returns>

Please note that this is the code comment for the new Team method. It retrieves detailed
information about a specific team. The method takes two parameters: org, which represents the
name of the organization, and teamId, which is the ID of the team. The method returns detailed
information about the team.

At this point, to prevent the model from producing additional text, you might need a different strategy: asking the model to check if certain conditions on the output are satisfied. This can be done by appending this line to the former prompt:

Check if the output contains additional text and, if so, remove it.

Asking the model to check if certain conditions in the input are satisfied is a very useful technique. It can also be exploited for more standard tasks, such as form or json/xml/html validation.

In this case, you also tried to validate the output text. This is more of a trick than a technique because the model doesn’t really produce the full output to be validated. Still, it works as a guardrail. A better way to achieve the same result would have been to add one more API call with the former prompt or, as explored later in book, involving a framework like Microsoft Guidance or Guardrails AI.

Considering this, it’s important to stress that these models work better when they are told what they need to do instead of what they must avoid.

Chain-of-thought scenarios

While standard few-shot prompting is effective for many tasks, it is not without limitations—particularly when it comes to more intricate reasoning tasks, such as mathematical and logical problems, as well as tasks that require the execution of multiple sequential steps.

When few-shot prompting proves insufficient, it may indicate the need for fine-tuning models (if these are an option, which they aren’t for GPT-4 and GPT-4-turbo) or exploring advanced prompting techniques. One such technique is chain-of-thought (CoT) prompting. You use CoT prompting to track down all the steps (thoughts) performed by the model to draw the solution.

As presented in the work of Wei et al. (2022), this technique gives the model time to think, enhancing reasoning abilities by incorporating intermediate reasoning steps. When used in conjunction with few-shot prompting, it leads to improved performance on intricate tasks that demand prior reasoning for accurate responses.

The basic theory

Anyone young enough to remember their days as a student will know that during exams, the brain stops functioning. Most of the time, one tries to answer the professor’s question, almost guessing the solution, without really thinking about it. LLMs do the same thing. Sometimes they continue the sentence without thinking; they just make a guess.

The basic idea behind CoT is to instruct the model to take time to think, allowing it to reconstruct the reasoning steps during response generation. This simple idea significantly enhances the model’s ability to perform complex tasks. This makes sense when you consider how language models produce responses. They attempt to predict the next token, so the more intermediate “reasoning” tokens they generate, the more the model needs to establish coherence among them and, ideally, provide a correct response.

A few examples

Following the professor-student example, the first two attempts to improve the output of the model might be the classical “make sure the answer is correct” or “let’s think step by step” approach. For instance, consider the following easy problem:

I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now.

Regarding logical problems, temperature and model version definitely matter. Lower temperature will keep things more deterministic, while the latest models will have better accuracy, because they have been trained on a larger dataset for logical problems.

With a temperature of 0.2, a GPT-3.5-turbo model will provide an answer like the following, whereas a GPT-4 model will probably say, “You currently have 7 pens”:

I have 8 pens now.
Explanation: Initially, I had 20 pens. I gave away 12 + 3 + 1 = 16 pens. So I had 20 - 16 = 4
pens left. Then I bought 4 more pens, so I had 4 + 4 = 8 pens. But I lost 1 pen, so I have
8 - 1 = 7 pens. Therefore, the output is 8 pens.

The answer is wrong. Worse, the answer is incoherent. It says 8, while it correctly calculates 7.

Let’s try the following prompt to see if it provides more reasonable output:

I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now and make sure the answer is correct and coherent. Make sure
the answer and the explanation are correct and coherent before answering.

Here, the output is as follows:

After buying 20 pens, I had 20 pens. After giving 12 pens to my son, I had 8 pens. After giving
3 pens to my daughter, I had 5 pens. After giving 1 pen to my mom, I had 4 pens. After buying 4
more pens, I had 8 pens. After losing 1 pen, I had 7 pens.
Therefore, I have 7 pens now.

That’s correct.

An equivalent prompt might have been:

I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now. Let's think it step by step.

And the output:

1. Start with 20 pens.
2. Gave 12 pens to my son, so I have 20 - 12 = 8 pens left.
3. Gave 3 pens to my daughter, so I have 8 - 3 = 5 pens left.
4. Gave 1 pen to my mom, so I have 5 - 1 = 4 pens left.
5. Bought 4 more pens, so I have 4 + 4 = 8 pens now.
6. Lost 1 pen, so I have 8 - 1 = 7 pens now.
Therefore, I have 7 pens now.

This is called a zero-shot chain-of-thought prompt; the magic sentence, “Let’s think it step by step,” forces the model to take time to think and to write down all the steps.

At this point, to get the final answer, you could ask the model to produce a structured output or make one more API call with a simple prompt like, “Extract only the final answer from this text”:

I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now. Let's think it step by step. Output a json with:
explanation (string) and result (int).

The result would look like:

{"explanation":"Initially, I had 20 pens. After giving 12 to my son, I had 8 left. Then, I gave
3 to my daughter, leaving me with 5. Giving 1 to my mom left me with 4 pens. Buying 4 more pens
gave me a total of 8 pens. Unfortunately, I lost 1 pen, leaving me with a final total of 7
pens.","result":7}

Possible extensions

Combining the few-shot technique with the chain-of-thought approach can give the model some examples of step-by-step reasoning to emulate. This is called few-shot chain-of-thought. For instance:

Which is the more convenient way to reach the destination, balancing costs and time?
Option 1: Take a 20-minute walk, then a 15-minute bus ride (2 dollars), and finally a 5-minute
taxi ride (15 dollars).
Option 2: Take a 30-minute bike ride, then a 10-minute subway ride (2 dollars), and finally a
5-minute walk.

Option 1 will take 20 + 15 + 5 = 40 minutes. Option 1 will cost 17 dollars.
Option 2 will take 30 + 10 + 5 = 45 minutes. Option 2 will cost 2 dollars.
Since Option 1 takes 40 minutes and Option 2 takes 45 minutes, Option 1 is quicker, but Option 2
is cheaper by far. Option 2 is better.

Which is the better way to get to the office?
Option 1: 40 minutes train (5 dollars), 15 mins walk
Option 2: 10-minutes taxi ride (15 dollars), 10-minutes subway (2 dollars), 2-mins walk

An extension of this basic prompting technique is Auto-CoT. This basically leverages the few-shot CoT approach, using a prompt to generate more samples (shots) of reasoning, which are then concatenated into a final prompt. Essentially, the idea is to auto-generate a few-shot CoT prompt.

Beyond chain-of-thought prompting, there is one more sophisticated idea: tree of thoughts. This technique can be implemented in essentially two ways. The first is through a single prompt, like the following:

Consider a scenario where three experts approach this question.
Each expert will contribute one step of their thought process and share it with the group.
Subsequently, all experts will proceed to the next step.
If any expert realizes they have made a mistake at any stage, they will exit the process.
The question is the following: {question}

A more sophisticated approach to tree of thoughts requires writing some more code, with different prompts running (maybe also with different temperatures) and producing reasoning paths. These paths are then evaluated by another model instance with a scoring/voting prompt, which excludes wrong ones. At the end, a certain mechanism votes (for coherence or majority) for the correct answer.

A few more emerging but relatively easy-to-implement prompting techniques are analogical prompting (by Google DeepMind), which asks the model to recall a similar problem before solving the current one; and step-back prompting, which prompts the model to step back from the specific instance and contemplate the general principle at hand.