Skip to main content

Basic Usage

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CHEAPESTINFERENCE_API_KEY"],
    base_url="https://api.cheapestinference.ai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

print(response.choices[0].message.content)

Message Roles

Chat messages support three roles:

System Messages

Set the assistant’s behavior and context:
messages = [
    {
        "role": "system",
        "content": "You are a professional translator. Translate all user messages to Spanish."
    }
]

User Messages

Messages from the user:
messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

Assistant Messages

Previous responses from the assistant (for conversation history):
messages = [
    {"role": "user", "content": "What's 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."},
    {"role": "user", "content": "What about 3+3?"}
]

Parameters

Temperature

Control randomness (0.0 to 2.0):
# More focused and deterministic
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a professional email"}],
    temperature=0.3
)

# More creative and diverse
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a creative story"}],
    temperature=0.9
)

Max Tokens

Limit response length:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Explain quantum physics"}],
    max_tokens=100  # Short response
)

Top P (Nucleus Sampling)

Alternative to temperature (0.0 to 1.0):
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Give me ideas"}],
    top_p=0.9  # Consider top 90% probability mass
)

Stop Sequences

Stop generation at specific tokens:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "List 5 fruits"}],
    stop=["\n\n", "6."]  # Stop at double newline or "6."
)

Frequency Penalty

Reduce repetition (-2.0 to 2.0):
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a poem"}],
    frequency_penalty=0.5  # Discourage repetition
)

Presence Penalty

Encourage new topics (-2.0 to 2.0):
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Brainstorm ideas"}],
    presence_penalty=0.6  # Encourage diversity
)

Completion Window

Time window for completing the request. Can be null, a duration string (e.g., ‘1s’, ‘24h’, ‘7d’), or ‘now’ for immediate processing. Example:
"24h"

Webhook URL

Optional webhook URL to receive completion notifications. Must be a valid HTTPS URL. Example:
"https://example.com/webhook"

Notification Email

Optional email address to receive completion notifications.

Streaming

Stream responses token-by-token for better UX:
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Multi-turn Conversations

Maintain conversation context:
messages = [
    {"role": "system", "content": "You are a helpful math tutor."}
]

# First turn
messages.append({"role": "user", "content": "What is 5 + 3?"})
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages
)
messages.append({"role": "assistant", "content": response.choices[0].message.content})

# Second turn
messages.append({"role": "user", "content": "What about 10 + 7?"})
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages
)

Response Format

Standard Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1729728000,
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

Streaming Response

{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729728000,"model":"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"finish_reason":null}]}
{"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1729728000,"model":"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

Error Handling

Handle API errors gracefully:
import os
from openai import OpenAI, APIError, RateLimitError

client = OpenAI(
    api_key=os.environ["CHEAPESTINFERENCE_API_KEY"],
    base_url="https://api.cheapestinference.ai/v1",
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except RateLimitError:
    print("Rate limit exceeded. Please try again later.")
except APIError as e:
    print(f"API error: {e.message}")

Best Practices

  • Set clear instructions in system message
  • Define the assistant’s role and constraints
  • Include examples if needed
  • Truncate old messages to stay within context window
  • Keep important context at the beginning
  • Use summarization for very long conversations
  • Lower temperature (0.3-0.5) for factual tasks
  • Higher temperature (0.7-1.0) for creative tasks
  • Use max_tokens to control costs
  • Always stream for user-facing applications
  • Show typing indicators during generation
  • Allow users to stop generation

Examples

Customer Support Bot

messages = [
    {
        "role": "system",
        "content": """You are a customer support agent for TechCorp.
        - Be helpful and professional
        - If you don't know, say so and offer to escalate
        - Keep responses concise"""
    },
    {"role": "user", "content": "How do I reset my password?"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages,
    temperature=0.3
)

Code Assistant

messages = [
    {
        "role": "system",
        "content": "You are an expert Python programmer. Provide code examples and explanations."
    },
    {"role": "user", "content": "How do I read a CSV file in Python?"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages,
    temperature=0.2
)

Creative Writing

messages = [
    {
        "role": "system",
        "content": "You are a creative writer specializing in science fiction."
    },
    {"role": "user", "content": "Write the opening paragraph of a story about time travel"}
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=messages,
    temperature=0.9,
    max_tokens=200
)