Core prompt learning techniques

Read this sample chapter from Programming Large Language Models with Azure Open AI. In Generative AI and software development, prompt learning techniques play a crucial role in conversational programming and involve the strategic design of prompts, which are then used to draw out desired responses from large language models (LLMs). You may end up with prompts that only partially address the very specific domain requests. This is where the need for fine-tuning emerges.

Prompt learning techniques play a crucial role in so-called “conversational programming,” the new paradigm of AI and software development that is now taking off. These techniques involve the strategic design of prompts, which are then used to draw out desired responses from large language models (LLMs).

Prompt engineering is the creative sum of all these techniques. It provides developers with the tools to guide, customize, and optimize the behavior of language models in conversational programming scenarios. Resulting prompts are in fact instrumental in guiding and tailoring responses to business needs, improving language understanding, and managing context.

Prompts are not magic, though. Quite the reverse. Getting them down is more a matter of trial and error than pure wizardry. Hence, at some point, you may end up with prompts that only partially address the very specific domain requests. This is where the need for fine-tuning emerges.

What is prompt engineering?

As a developer, you use prompts as instructional input for the LLM. Prompts convey your intent and guide the model toward generating appropriate and contextually relevant responses that fulfill specific business needs. Prompts act as cues that inform the model about the desired outcome, the context in which it should operate, and the type of response expected. More technically, the prompt is the point from which the LLM begins to predict and then output new tokens.

Prompts at a glance

Let’s try some prompts with a particular LLM—specifically, GPT-3.5-turbo. Be aware, though, that LLMs are not deterministic tools, meaning that the response they give for the same input may be different every time.

A very basic prompt

The hello-world of prompt engineering—easily testable online on Bing Chat, ChatGPT, or something similar—can be as simple as what’s shown here:

During the week I

This prompt might result in something like the following output:

During the week, I typically follow a structured routine.

Overall, the answer makes sense: The model tries to provide a continuation of the string, given the understood context.

Let’s try something a bit more specific:

Complete the following sentence, as if you were Shakespeare.
During the week I

The subsequent output might be similar to:

During the week, I doth engage in myriad tasks and endeavors, as the sun traverseth the sky with
unwavering pace.

So far so good.

A more complex prompt

One relatively complex prompt might be the following:

'Unacceptable risk AI systems are systems considered a threat to people and will be banned. They
include:
-Cognitive behavioral manipulation of people or specific vulnerable groups: for example,
voice-activated toys that encourage dangerous behavior in children
-Social scoring: classifying people based on behavior, socio-economic status, or personal
characteristics
-Real-time and remote biometric identification systems, such as facial recognition
Some exceptions may be allowed: For instance, "post" remote biometric identification systems
where identification occurs after a significant delay will be allowed to prosecute serious
crimes but only after court approval.'
Given the above, extract only the forbidden AI applications and output them as json.

The model might now output the following JSON string:

{
  "Forbidden AI Applications":[
    {
      "Application":"Cognitive behavioral manipulation of people or specific vulnerable groups",
      "Example": "Voice-activated toys that encourage dangerous behavior in children"
    },
    {
      "Application":"Social scoring",
      "Example":"Classifying on behavior, socio-economic status or personal characteristics"
    },
    {
      "Application":"Real-time and remote biometric identification systems",
      "Example":"Facial recognition"
    }
  ]
}

Encouraged by these first experiments, let’s try to outline some general rules for prompts.

General rules for prompts

A prompt can include context, instructions, input data, and optionally the structure of the desired output (also in the form of explicit examples). Depending on the task, you might need all four pieces or only a couple of them—most likely, instructions and input data.

Designing a prompt is an iterative process. Not surprisingly, the first reply you get from a model might be quite unreasonable. Don’t give up; just try again, but be more precise in what you provide, whether it’s plain instructions, input data, or context.

Two key points for a good prompt are specificity and descriptiveness.

  • Specificity means designing prompts to leave as little room for interpretation as possible. By providing explicit instructions and restricting the operational space, developers can guide the language model to generate more accurate and desired outputs.

  • Descriptiveness plays a significant role in effective prompt engineering. By using analogies and vivid descriptions, developers can provide clear instructions to the model. Analogies serve as valuable tools for conveying complex tasks and concepts, enabling the model to grasp the desired output with improved context and understanding.

General tips for prompting

A more technical tip is to use delimiters to clearly indicate distinct parts of the prompt. This helps the model focus on the relevant parts of the prompt. Usually, backticks or backslashes work well. For instance:

Extract sentiment from the following text delimited by triple backticks: '''Great choice!'''

When the first attempt fails, two simple design strategies might help:

  • Doubling down on instructions is useful to reinforce clarity and consistency in the model’s responses. Repetition techniques, such as providing instructions both before and after the primary content or using instruction-cue combinations, strengthen the model’s understanding of the task at hand.

  • Changing the order of the information presented to the model. The order of information presented to the language model is significant. Whether instructions precede the content (summarize the following) or follow it (summarize the preceding) can lead to different results. Additionally, the order of few-shot examples (which will be covered shortly) can also introduce variations in the model’s behavior. This concept is known as recency bias.

One last thing to consider is an exit strategy for the model in case it fails to respond adequately. The prompt should instruct the model with an alternative path—in other words, an out. For instance, when asking a question about some documents, including a directive such as write 'not found' if you can't find the answer within the document or check if the conditions are satisfied before answering allows the model to gracefully handle situations in which the desired information is unavailable. This helps to avoid the generation of false or inaccurate responses.

Alternative ways to alter output

When aiming to align the output of an LLM more closely with the desired outcome, there are several options to consider. One approach involves modifying the prompt itself, following best practices and iteratively improving results. Another involves working with inner parameters (also called hyperparameters) of the model.

Beyond the purely prompt-based conversational approach, there are a few screws to tighten—comparable to the old-but-gold hyperparameters in the classic machine learning approach. These include the number of tokens, temperature, top_p (or nucleus) sampling, frequency penalties, presence penalties, and stop sequences.

Temperature versus top_p

Temperature (T) is a parameter that influences the level of creativity (or “randomness”) in the text generated by an LLM. The usual range of acceptable values is 0 to 2, but it depends on the specific model. When the temperature value is high (say, 0.8), the output becomes more diverse and imaginative. Conversely, a lower temperature (say, 0.1), makes the output more focused and deterministic.

Temperature affects the probability distribution of potential tokens at each step of the generation process. In practice, when choosing the next token, a model with a temperature of 0 will always choose the most probable one, while a model with a higher temperature will choose a token more or less randomly. A temperature of 0, therefore, would make the model entirely deterministic.

An alternative technique called top_p sampling (or nucleus sampling) is also useful for altering the default behavior of the LLM when generating the next token. With top_p sampling, instead of considering all possible tokens, the LLM focuses only on a subset of tokens (known as the nucleus) whose cumulative probability mass adds up to a specified threshold called top_p.

With top_p, the range of acceptable values is from 0 to 1. If, for instance, if the top_p value is set to 0.3, the language model will consider only tokens that contribute to the top 30% of the probability mass for the next token. This means lower top_p values limit the vocabulary of the model.

Both temperature and top_p sampling serve as powerful tools for controlling an LLM’s behavior, enabling different levels of creativity and control. They can be used independently or in combination, although it is generally recommended to alter only one at a time.

Adjusting these parameters makes them versatile for a wide range of applications. For instance, for a creative-writing task, a temperature of 1.3 (while also playing around with top_p) could be good, while for a code-generation task, 0.4 (again, while also experimenting with top_p) would be better.

Consider this prompt:

Rephrase the following text:
<<<When aiming to align the output of a large language model (LLM) more closely with the desired
outcome, there are several options to consider. One approach involves modifying the prompt
itself, while another involves working with hyperparameters of the model>>>

With a temperature of 2, the prompt returns something like the following (which is grammatically incorrect):

When attempting to have a large language model (LLM) output that reflects the desired outcome,
there are numerous alternatives to ponder. adapting the prompt itself is called one approach
that credits sustainability. Grounding on hyperparameters programs appertains settings amidst
the indicators simultaneously participates confidentiality impact later goals aren abbreviated
instructions. Otherwise instructions brings alternative ramification which doesn omit lengthy
resolution reaching goals which aren susceptible to evaluations constantly... likelihood
advancement adjustment exemplark.

With a temperature of 0, it returns the following:

To achieve better alignment between the output of a language model (LLM) and the desired
outcome, there are various options to explore. One option is to adjust the prompt, while another
involves tweaking the model's hyperparameters.

In the middle, with a temperature of 0.8, the result looks like this:

To achieve a greater degree of alignment between the output of a language model (LLM) and the
intended result, there are various approaches to consider. One tactic involves adjusting the
prompt directly, whereas an alternative involves manipulating the model's hyperparameters.

Frequency and presence penalties

Another set of parameters is the frequency and presence penalty. These add a penalty when calculating probability of the next token. This results in a recalculation of each probability, which ultimately affects which token is chosen.

The frequency penalty is applied to tokens that have already been mentioned in the preceding text (including the prompt). It is scaled based on the number of times the token has appeared. For example, a token that has appeared five times receives a higher penalty, reducing its likelihood of appearing again, than a token that has appeared only once. The presence penalty, on the other hand, applies a penalty to tokens regardless of their frequency. Once a token has appeared at least once before, it will be subject to the penalty. The range of acceptable values for both is from –2 to 2.

These parameter settings are valuable for eliminating (or promoting, in the case of negative values) repetitive elements from generated outputs. For instance, consider this prompt:

Rephrase the following text:
<<<When aiming to align the output of a large language model (LLM) more closely with the desired
outcome, there are several options to consider. One approach involves modifying the prompt
itself, while another involves working with hyperparameters of the model>>>

With a frequency penalty of 2, it returns something like:

To enhance the accuracy of a large language model's (LLM) output to meet the desired result,
there are various strategies to explore. One method involves adjusting the prompt itself,
whereas another entails manipulating the model's hyperparameters.

While with a frequency penalty of 0, it returns something like:

There are various options to consider when attempting to better align the output of a language
model (LLM) with the desired outcome. One option is to modify the prompt, while another is to
adjust the model's hyperparameters.

Max tokens and stop sequences

The max tokens parameter specifies the maximum number of tokens that can be generated by the model, while the stop sequence parameter instructs the language model to halt the generation of further content. Stop sequences are in fact an additional mechanism for controlling the length of the model’s output.

Consider the following prompt:

Paris is the capital of

The model will likely generate France. If a full stop (.) is designated as the stop sequence, the model will cease generating text when it reaches the end of the first sentence, regardless of the specified token limit.

A more complex example can be built with a few-shot approach, which uses a pair of angled brackets (<<>>) on each end of a sentiment. Considering the following prompt:

Extract sentiment from the following tweets:
Tweet: I love this match!
Sentiment: <<positive>>
Tweet: Not sure I completely agree with you
Sentiment: <<neutral>>
Tweet: Amazing movie!!!
Sentiment:

Including the angled brackets instructs the model to stop generating tokens after extracting the sentiment.

By using stop sequences strategically within prompts, developers can ensure that the model generates text up to a specific point, preventing it from producing unnecessary or undesired information. This technique proves particularly useful in scenarios where precise and limited-length responses are desired, such as when generating short summaries or single-sentence outputs.

Setting up for code execution

Now that you’ve learned the basic theoretical background of prompting, let’s bridge the gap between theory and practical implementation. This section transitions from discussing the intricacies of prompt engineering to the hands-on aspect of writing code. By translating insights into executable instructions, you’ll explore the tangible outcomes of prompt manipulation.

In this section, you’ll focus on OpenAI models, like GPT-4, GPT-3.5-turbo, and their predecessors. (Other chapters might use different models.) For these examples, .NET and C# will be used mainly, but Python will also be used at some point.

Getting access to OpenAI APIs

To access OpenAI APIs, there are multiple options available. You can leverage the REST APIs from OpenAI or Azure OpenAI, the Azure OpenAI .NET or Python SDK, or the OpenAI Python package.

In general, Azure OpenAI Services enable Azure customers to use those advanced language AI models, while still benefiting from the security and enterprise features offered by Microsoft Azure, such as private networking, regional availability, and responsible AI content filtering.

At first, directly accessing OpenAI could be the easiest choice. However, when it comes to enterprise implementations, Azure OpenAI is the more suitable option due to its alignment with the Azure platform and its enterprise-grade features.

To get started with Azure OpenAI, your Azure subscription must include access to Azure OpenAI, and you must set up an Azure OpenAI Service resource with a deployed model.

If you choose to use OpenAI directly, you can create an API key on the developer site (https://platform.openai.com/).

In terms of technical differences, OpenAI uses the model keyword argument to specify the desired model, whereas Azure OpenAI employs the deployment_id keyword argument to identify the specific model deployment to use.

Chat Completion API versus Completion API

OpenAI APIs offer two different approaches for generating responses from language models: the Chat Completion API and the Completion API. Both are available in two modes: a standard form, which returns the complete output once ready, and a streaming version, which streams the response token by token.

The Chat Completion API is designed for chat-like interactions, where message history is concatenated with the latest user message in JSON format, allowing for controlled completions. In contrast, the Completion API provides completions for a single prompt and takes a single string as input.

The back-end models used for the two APIs differ:

  • The Chat Completion API supports GPT-4-turbo, GPT-4, GPT-4-0314, GPT-4-32k, GPT-4-32k-0314, GPT-3.5-turbo, and GPT-3.5-turbo-0301.

  • The Completion API includes older (but still good for some use cases) models, such as text-davinci-003, text-davinci-002, text-curie-001, text-babbage-001, and text-ada-001.

One advantage of the Chat Completion API is the role selection feature, which enables users to assign roles to different entities in the conversation, such as user, assistant, and, most importantly, system. The first system message provides the model with the main context and instructions “set in stone.” This helps in maintaining consistent context throughout the interaction. Moreover, the system message helps set the behavior of the assistant. For example, you can modify the personality or tone of the assistant or give specific instructions on how it should respond. Additionally, the Chat Completion API allows for longer conversational context to be appended, enabling a more dynamic conversation flow. In contrast, the Completion API does not include the role selection or conversation formatting features. It takes a single prompt as input and generates a response accordingly.

Both APIs provide finish_reasons in the response to indicate the completion status. Possible finish_reasons values include stop (complete message or a message terminated by a stop sequence), length (incomplete output due to token limits), function_call (model calling a function), content_filter (omitted content due to content filters), and null (response still in progress).

Although OpenAI recommends the Chat Completion API for most use cases, the raw Completion API sometimes offers more potential for creative structuring of requests, allowing users to construct their own JSON format or other formats. The JSON output can be forced in the Chat Completion API by using the JSON mode with the response_format parameter set to json_object.

To summarize, the Chat Completion API is a higher-level API that generates an internal prompt and calls some lower-level API and is suited for chat-like interactions with role selection and conversation formatting. In contrast, the Completion API is focused on generating completions for individual prompts.

It’s worth mentioning that the two APIs are to some extent interchangeable. That is, a user can force the format of a Chat Completion response to reflect the format of a Completion response by constructing a request using a single user message. For instance, one can translate from English to Italian with the following Completion prompt:

Translate the following English text to Italian: "{input}"

An equivalent Chat Completion prompt would be:

[{"role": "user", "content": 'Translate the following English text to Italian: "{input}"'}]

Similarly, a user can use the Completion API to mimic a conversation between a user and an assistant by appropriately formatting the input.

Setting things up in C#

You can now set things up to use Azure OpenAI API in Visual Studio Code through interactive .NET notebooks, which you will find in the source code that comes with this book. The model used is GPT-3.5-turbo. You set up the necessary NuGet package—in this case, Azure.AI.OpenAI—with the following line:

#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.12"

Then, moving on with the C# code:

using System;
using Azure.AI.OpenAI;
var AOAI_ENDPOINT = Environment.GetEnvironmentVariable("AOAI_ENDPOINT");
var AOAI_KEY = Environment.GetEnvironmentVariable("AOAI_KEY");
var AOAI_DEPLOYMENTID = Environment.GetEnvironmentVariable("AOAI_DEPLOYMENTID");
var AOAI_chat_DEPLOYMENTID = Environment.GetEnvironmentVariable("AOAI_chat_DEPLOYMENTID");
var endpoint = new Uri(AOAI_ENDPOINT);
var credentials = new Azure.AzureKeyCredential(AOAI_KEY);
var openAIClient = new OpenAIClient(endpoint, credentials);
var completionOptions = new ChatCompletionsOptions
{
    DeploymentName=AOAI_DEPLOYMENTID,
    MaxTokens=500,
    Temperature=0.7f,
    FrequencyPenalty=0f,
    PresencePenalty=0f,
    NucleusSamplingFactor=1,
    StopSequences={}
};

var prompt =
    @"rephrase the following text: <<<When aiming to align the output of a language model (LLM)
more closely with the desired outcome, there are several options to consider. One approach
involves modifying the prompt itself, while another involves working with hyperparameters of the
model>>>";

completionOptions.Messages.Add(new ChatRequestUserMessage (prompt));
var response = await openAIClient.GetChatCompletionsAsync(completionOptions);
var completions = response.Value;
completions.Choices[0].Message.Content.Display();

After running this code, one possible output displayed in the notebook is as follows:

There are various ways to bring the output of a language model (LLM) closer to the intended
result. One method is to adjust the prompt, while another involves tweaking the model's
hyperparameters.

Note that the previous code uses the Chat Completion version of the API. A similar result could have been obtained through the following code, which uses the Completion API and an older model:

var completionOptions = new CompletionsOptions
{
   DeploymentName=AOAI_DEPLOYMENTID,
   Prompts={prompt},
   MaxTokens=500,
   Temperature=0.2f,
   FrequencyPenalty=0.0f,
   PresencePenalty=0.0f,NucleusSamplingFactor=1,
   StopSequences={"."}
};
Completions response = await openAIClient.GetCompletionsAsync(completionOptions);
response.Choices.First().Text.Display();

Setting things up in Python

If you prefer working with Python, put the following equivalent code in a Jupyter Notebook:

import os
import openai
from openai import AzureOpenAI
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

client = AzureOpenAI(
  azure endpoint = os.getenv("AZURE OPENAI ENDPOINT"),
  api key=os.getenv("AZURE OPENAI KEY"),
  openai.api_version="2023-09-01-preview"
)
deployment_name=os.getenv("AOAI_DEPLOYMENTID")
context = [ {'role':'user', 'content':"rephrase the following text: 'When aiming to align the
output of a language model (LLM) more closely with the desired outcome, there are several
options to consider: one approach involves modifying the prompt itself, while another involves
working with hyperparameters of the model.'"} ]
response = client.chat.completions.create(
                       model=deployment_name,
                       messages=context,
                       temperature=0.7)
response.choices[0].message["content"]

This is based on OpenAI Python SDK v.1.6.0, which can be installed via pip install openai.