Chat

Basic Usage

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CHEAPESTINFERENCE_API_KEY"],
    base_url="https://api.cheapestinference.ai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

print(response.choices[0].message.content)

Message Roles

Chat messages support three roles:

System Messages

Set the assistant’s behavior and context:

messages = [
    {
        "role": "system",
        "content": "You are a professional translator. Translate all user messages to Spanish."
    }
]

User Messages

Messages from the user:

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

Assistant Messages

Previous responses from the assistant (for conversation history):

messages = [
    {"role": "user", "content": "What's 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."},
    {"role": "user", "content": "What about 3+3?"}
]

Parameters

Temperature

Control randomness (0.0 to 2.0):

# More focused and deterministic
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a professional email"}],
    temperature=0.3
)

# More creative and diverse
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a creative story"}],
    temperature=0.9
)

Max Tokens

Limit response length:

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Explain quantum physics"}],
    max_tokens=100  # Short response
)

Top P (Nucleus Sampling)

Alternative to temperature (0.0 to 1.0):

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Give me ideas"}],
    top_p=0.9  # Consider top 90% probability mass
)

Stop Sequences

Stop generation at specific tokens:

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "List 5 fruits"}],
    stop=["\n\n", "6."]  # Stop at double newline or "6."
)

Frequency Penalty

Reduce repetition (-2.0 to 2.0):

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a poem"}],
    frequency_penalty=0.5  # Discourage repetition
)

Presence Penalty

Encourage new topics (-2.0 to 2.0):

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Brainstorm ideas"}],
    presence_penalty=0.6  # Encourage diversity
)

Completion Window

Time window for completing the request. Can be null, a duration string (e.g., ‘1s’, ‘24h’, ‘7d’), or ‘now’ for immediate processing. Example:

"24h"

Webhook URL

Optional webhook URL to receive completion notifications. Must be a valid HTTPS URL. Example:

"https://example.com/webhook"

Notification Email

Optional email address to receive completion notifications.

Streaming

Stream responses token-by-token for better UX:

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Multi-turn Conversations

Maintain conversation context:

messages = [
    {"role": "system", "content": "You are a helpful math tutor."}
]

# First turn
messages.append({"role": "user", "content": "What is 5 + 3?"})
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Second turn
messages.append({"role": "user", "content": "What about 10 + 7?"})
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages
)

Response Format

Standard Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1729728000,
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

Streaming Response

{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729728000,"model":"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"finish_reason":null}]}
{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729728000,"model":"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

Error Handling

Handle API errors gracefully:

import os
from openai import OpenAI, APIError, RateLimitError

client = OpenAI(
    api_key=os.environ["CHEAPESTINFERENCE_API_KEY"],
    base_url="https://api.cheapestinference.ai/v1",
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except RateLimitError:
    print("Rate limit exceeded. Please try again later.")
except APIError as e:
    print(f"API error: {e.message}")

Best Practices

Use system messages effectively

Set clear instructions in system message
Define the assistant’s role and constraints
Include examples if needed

Manage conversation history

Truncate old messages to stay within context window
Keep important context at the beginning
Use summarization for very long conversations

Optimize parameters

Lower temperature (0.3-0.5) for factual tasks
Higher temperature (0.7-1.0) for creative tasks
Use max_tokens to control costs

Stream for better UX

Always stream for user-facing applications
Show typing indicators during generation
Allow users to stop generation

Examples

Customer Support Bot

messages = [
    {
        "role": "system",
        "content": """You are a customer support agent for TechCorp.
        - Be helpful and professional
        - If you don't know, say so and offer to escalate
        - Keep responses concise"""
    },
    {"role": "user", "content": "How do I reset my password?"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages,
    temperature=0.3
)

Code Assistant

messages = [
    {
        "role": "system",
        "content": "You are an expert Python programmer. Provide code examples and explanations."
    },
    {"role": "user", "content": "How do I read a CSV file in Python?"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages,
    temperature=0.2
)

Creative Writing

messages = [
    {
        "role": "system",
        "content": "You are a creative writer specializing in science fiction."
    },
    {"role": "user", "content": "Write the opening paragraph of a story about time travel"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages,
    temperature=0.9,
    max_tokens=200
)

Getting Started

Inference

Capabilities

Other APIs

Basic Usage

Message Roles

System Messages

User Messages

Assistant Messages

Parameters

Temperature

Max Tokens

Top P (Nucleus Sampling)

Stop Sequences

Frequency Penalty

Presence Penalty

Completion Window

Webhook URL

Notification Email

Optional email address to receive completion notifications.

Streaming

Multi-turn Conversations

Response Format

Standard Response

Streaming Response

Error Handling

Best Practices

Examples

Customer Support Bot

Code Assistant

Creative Writing

Getting Started

Inference

Capabilities

Other APIs

​Basic Usage

​Message Roles

​System Messages

​User Messages

​Assistant Messages

​Parameters

​Temperature

​Max Tokens

​Top P (Nucleus Sampling)

​Stop Sequences

​Frequency Penalty

​Presence Penalty

​Completion Window

​Webhook URL

​Notification Email

​Optional email address to receive completion notifications.

​Streaming

​Multi-turn Conversations

​Response Format

​Standard Response

​Streaming Response

​Error Handling

​Best Practices

​Examples

​Customer Support Bot

​Code Assistant

​Creative Writing

Basic Usage

Message Roles

System Messages

User Messages

Assistant Messages

Parameters

Temperature

Max Tokens

Top P (Nucleus Sampling)

Stop Sequences

Frequency Penalty

Presence Penalty

Completion Window

Webhook URL

Notification Email

Optional email address to receive completion notifications.

Streaming

Multi-turn Conversations

Response Format

Standard Response

Streaming Response

Error Handling

Best Practices

Examples

Customer Support Bot

Code Assistant

Creative Writing