arsalandywriter.com

Unlocking the Power of OpenAI's Text-Embedding-3 Technology

Written on

Chapter 1: Introduction to Text-Embedding-3

If you are developing a sophisticated bot or automation with OpenAI, evaluating the similarity between various text segments is likely essential. Text embeddings facilitate this process.

By the conclusion of this article, you will gain insights into using Text-Embedding-3, witness its performance enhancements, and engage in a hands-on project that utilizes text embeddings.

Before we dive in! 🦸🏻‍♀️

If you enjoy this topic and wish to support my work, please clap for this article 50 times. Your encouragement means a lot! 👏 Also, consider following me on Medium and subscribing for my latest articles. 🫶

Section 1.1: Overview of Text-Embedding-3

OpenAI has unveiled a new embedding model known as Text-Embedding-3, which features two distinct variants:

  • text-embedding-3-small: A compact and efficient option.
  • text-embedding-3-large: A more powerful, larger model.

The text-embedding-3-small variant achieves an average score between 31.4% and 44.0% on the multilingual search benchmark (MIRACL), significantly outperforming the previous model, text-embedding-ada-002, which had a benchmark score of 61.0% to 62.3% on English tasks (MTEB).

Additionally, the text-embedding-3-large model can create embeddings up to 3072 dimensions, achieving average scores of 54.9% on MIRACL and 64.6% on MTEB, showcasing a marked improvement. Notably, this model is also five times more cost-effective than its predecessor.

Section 1.2: Pricing of Text-Embedding-3

The costs associated with Text-Embedding-3 are as follows:

  • For text-embedding-3-small, the fee is approximately $0.00002 per 1K tokens.
  • For text-embedding-3-large, it is about $0.00013 per 1K tokens.

Chapter 2: Implementing Text-Embedding-3

The video titled "Intro to OpenAI GPT-4 Text-Embedding-3 (New-Update)" provides a comprehensive introduction to the new embedding features and their applications.

To get started with Text-Embedding-3, let’s delve into coding.

Section 2.1: Setting Up Your Environment

First, install the OpenAI package. For terminal users, run:

pip install openai

If you are on Mac or Linux, use:

pip3 install openai

Once the necessary libraries are installed, we will import the required modules:

import os

from openai import OpenAI

import numpy as np

For this tutorial, I will utilize the gpt-3.5-turbo OpenAI model, known for its speed and cost-efficiency:

os.environ["OPENAI_API_KEY"] = "Your OpenAPI Key"

client = OpenAI()

Now, let's create a function that accepts input text and the embedding model. This function will ensure the text is properly formatted for the embedding model by removing newline characters, and subsequently, it will generate an embedding vector.

def convert_text_to_embedding(input_text, embedding_model="text-embedding-3-small"):

# Replace newline characters in the text

processed_text = input_text.replace("n", " ")

# Generate the embedding vector for the processed text using the specified model

embedding_vector = client.embeddings.create(input=[processed_text], model=embedding_model).data[0].embedding

return embedding_vector

Let’s convert the text "Good Morning" into an embedding vector:

text = "Good Morning"

print(len(convert_text_to_embedding(text)))

This will yield an embedding of 1536 dimensions. For those less familiar with technical specifics, let’s break things down further.

Section 2.2: Calculating Cosine Similarity

We will utilize text embedding technology to assess the similarity between two sentences by creating a cosine_similarity function.

What is Cosine Similarity?

Cosine similarity quantifies the similarity between two vectors within an inner product space.

Here’s how to create a function that calculates this similarity:

def calculate_cosine_similarity(vector_a, vector_b):

# Compute the dot product of the two vectors

dot_prod = np.dot(vector_a, vector_b)

# Calculate the magnitude (norm) of each vector

norm_a = np.linalg.norm(vector_a)

norm_b = np.linalg.norm(vector_b)

# Calculate cosine similarity

cos_similarity = dot_prod / (norm_a * norm_b)

return cos_similarity

Assuming the get_embedding function is previously defined, we can now evaluate the similarity between two given texts.

embed_vector1 = np.array(convert_text_to_embedding("Text1"))

embed_vector2 = np.array(convert_text_to_embedding("Text2"))

# Calculate similarity

similarity_score = calculate_cosine_similarity(embed_vector1, embed_vector2)

print(f"Cosine similarity: {similarity_score}")

Let’s test this function using two sentences: "HotDog is good" and "I love chocolate."

Given that both phrases pertain to food, we should expect some level of similarity.

Cosine similarity: 0.21515305543885868

Although the similarity is not very high, the words "Good" and "chocolate" convey distinct meanings.

For another example, consider "Monkey" and "Banana":

Cosine similarity: 0.41164899669893257

Despite their apparent differences, the high score suggests a connection due to contextual associations.

Section 2.3: Comparing Different Embedding Models

Next, we will compare the performance of different embedding models to illustrate their capabilities in calculating sentence similarity.

def get_embedding_small(text, model="text-embedding-3-small"):

text = text.replace("n", " ")

return client.embeddings.create(input=[text], model=model).data[0].embedding

def get_embedding_large(text, model="text-embedding-3-large"):

text = text.replace("n", " ")

return client.embeddings.create(input=[text], model=model).data[0].embedding

def get_embedding_ada(text, model="text-embedding-ada-002"):

text = text.replace("n", " ")

return client.embeddings.create(input=[text], model=model).data[0].embedding

Let's calculate and compare the similarity of the outputs from each model using the words "banana" and "monkey."

text1 = "banana"

text2 = "monkey"

vector1 = np.array(get_embedding_small(text1))

vector2 = np.array(get_embedding_small(text2))

print("text-embedding-3-small")

similarity = calculate_cosine_similarity(vector1, vector2)

print(f"Cosine similarity: {similarity}")

# Repeat for other models...

Next, I prompted ChatGPT to generate two sentences that are semantically similar:

  1. "The sun set slowly in the west, painting the sky with shades of orange and pink."
  2. "As the day ended, the horizon was ablaze with a beautiful mixture of orange and pink hues."

The cosine similarity scores should reflect their similarity:

text-embedding-3-small

Cosine similarity : 0.8428169121751229

text-embedding-3-large

Cosine similarity : 0.860045753726878

text-embedding-ada-002

Cosine similarity : 0.9588945157630332

The highest similarity is noted with the text-embedding-ada-002 model, although the text-embedding-3 scores are more representative of actual semantic similarity.

Next, consider these two sentences that are entirely dissimilar:

  1. "A cat lounged lazily in the warm afternoon sun, purring contentedly."
  2. "Heavy rain clouds gathered ominously, signalling an approaching thunderstorm."

The results are as follows:

text-embedding-3-small

Cosine similarity : 0.014516211086674564

text-embedding-3-large

Cosine similarity : 0.035299185035971375

text-embedding-ada-002

Cosine similarity : 0.7360667138087893

The lower scores for the text-embedding-3 models indicate their ability to recognize dissimilarity, while the text-embedding-ada-002 model tends to produce higher values.

In conclusion, across all tests, text-embedding-3 demonstrates superior performance in accurately capturing semantic similarity and dissimilarity.

Chapter 3: Future Prospects

As we look ahead, the potential for remarkable text generation is on the horizon. The embedding models discussed in this article will facilitate more advanced natural language processing tasks. It seems likely that the accuracy of language-related tasks and other model generation tasks will see significant enhancements.

More Ideas On My Page

🧙‍♂️ I am an AI application expert! If you're interested in collaborating on a project, please reach out or book a one-on-one consulting call with me.

📚 Feel free to explore my other articles:

The video titled "OpenAI's Text-Embedding-3 in 7 Minutes" offers a quick overview of the Text-Embedding-3 features and how they can be applied.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

The Future of Innovation: What's Next for Million-Dollar Ideas?

Exploring the evolution of inventions and pondering what the next big idea might be.

Unlocking the Secret to Consistent Workouts: A Guide

Discover the keys to maintaining a regular workout routine and achieving your fitness goals.

Character is Destiny: 10 Practices to Enhance Your Life

Explore ten essential habits to improve yourself and become a better person.