Unlocking the Power of OpenAI's Text-Embedding-3 Technology
Written on
Chapter 1: Introduction to Text-Embedding-3
If you are developing a sophisticated bot or automation with OpenAI, evaluating the similarity between various text segments is likely essential. Text embeddings facilitate this process.
By the conclusion of this article, you will gain insights into using Text-Embedding-3, witness its performance enhancements, and engage in a hands-on project that utilizes text embeddings.
Before we dive in! 🦸🏻♀️
If you enjoy this topic and wish to support my work, please clap for this article 50 times. Your encouragement means a lot! 👏 Also, consider following me on Medium and subscribing for my latest articles. 🫶
Section 1.1: Overview of Text-Embedding-3
OpenAI has unveiled a new embedding model known as Text-Embedding-3, which features two distinct variants:
- text-embedding-3-small: A compact and efficient option.
- text-embedding-3-large: A more powerful, larger model.
The text-embedding-3-small variant achieves an average score between 31.4% and 44.0% on the multilingual search benchmark (MIRACL), significantly outperforming the previous model, text-embedding-ada-002, which had a benchmark score of 61.0% to 62.3% on English tasks (MTEB).
Additionally, the text-embedding-3-large model can create embeddings up to 3072 dimensions, achieving average scores of 54.9% on MIRACL and 64.6% on MTEB, showcasing a marked improvement. Notably, this model is also five times more cost-effective than its predecessor.
Section 1.2: Pricing of Text-Embedding-3
The costs associated with Text-Embedding-3 are as follows:
- For text-embedding-3-small, the fee is approximately $0.00002 per 1K tokens.
- For text-embedding-3-large, it is about $0.00013 per 1K tokens.
Chapter 2: Implementing Text-Embedding-3
The video titled "Intro to OpenAI GPT-4 Text-Embedding-3 (New-Update)" provides a comprehensive introduction to the new embedding features and their applications.
To get started with Text-Embedding-3, let’s delve into coding.
Section 2.1: Setting Up Your Environment
First, install the OpenAI package. For terminal users, run:
pip install openai
If you are on Mac or Linux, use:
pip3 install openai
Once the necessary libraries are installed, we will import the required modules:
import os
from openai import OpenAI
import numpy as np
For this tutorial, I will utilize the gpt-3.5-turbo OpenAI model, known for its speed and cost-efficiency:
os.environ["OPENAI_API_KEY"] = "Your OpenAPI Key"
client = OpenAI()
Now, let's create a function that accepts input text and the embedding model. This function will ensure the text is properly formatted for the embedding model by removing newline characters, and subsequently, it will generate an embedding vector.
def convert_text_to_embedding(input_text, embedding_model="text-embedding-3-small"):
# Replace newline characters in the text
processed_text = input_text.replace("n", " ")
# Generate the embedding vector for the processed text using the specified model
embedding_vector = client.embeddings.create(input=[processed_text], model=embedding_model).data[0].embedding
return embedding_vector
Let’s convert the text "Good Morning" into an embedding vector:
text = "Good Morning"
print(len(convert_text_to_embedding(text)))
This will yield an embedding of 1536 dimensions. For those less familiar with technical specifics, let’s break things down further.
Section 2.2: Calculating Cosine Similarity
We will utilize text embedding technology to assess the similarity between two sentences by creating a cosine_similarity function.
What is Cosine Similarity?
Cosine similarity quantifies the similarity between two vectors within an inner product space.
Here’s how to create a function that calculates this similarity:
def calculate_cosine_similarity(vector_a, vector_b):
# Compute the dot product of the two vectors
dot_prod = np.dot(vector_a, vector_b)
# Calculate the magnitude (norm) of each vector
norm_a = np.linalg.norm(vector_a)
norm_b = np.linalg.norm(vector_b)
# Calculate cosine similarity
cos_similarity = dot_prod / (norm_a * norm_b)
return cos_similarity
Assuming the get_embedding function is previously defined, we can now evaluate the similarity between two given texts.
embed_vector1 = np.array(convert_text_to_embedding("Text1"))
embed_vector2 = np.array(convert_text_to_embedding("Text2"))
# Calculate similarity
similarity_score = calculate_cosine_similarity(embed_vector1, embed_vector2)
print(f"Cosine similarity: {similarity_score}")
Let’s test this function using two sentences: "HotDog is good" and "I love chocolate."
Given that both phrases pertain to food, we should expect some level of similarity.
Cosine similarity: 0.21515305543885868
Although the similarity is not very high, the words "Good" and "chocolate" convey distinct meanings.
For another example, consider "Monkey" and "Banana":
Cosine similarity: 0.41164899669893257
Despite their apparent differences, the high score suggests a connection due to contextual associations.
Section 2.3: Comparing Different Embedding Models
Next, we will compare the performance of different embedding models to illustrate their capabilities in calculating sentence similarity.
def get_embedding_small(text, model="text-embedding-3-small"):
text = text.replace("n", " ")
return client.embeddings.create(input=[text], model=model).data[0].embedding
def get_embedding_large(text, model="text-embedding-3-large"):
text = text.replace("n", " ")
return client.embeddings.create(input=[text], model=model).data[0].embedding
def get_embedding_ada(text, model="text-embedding-ada-002"):
text = text.replace("n", " ")
return client.embeddings.create(input=[text], model=model).data[0].embedding
Let's calculate and compare the similarity of the outputs from each model using the words "banana" and "monkey."
text1 = "banana"
text2 = "monkey"
vector1 = np.array(get_embedding_small(text1))
vector2 = np.array(get_embedding_small(text2))
print("text-embedding-3-small")
similarity = calculate_cosine_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity}")
# Repeat for other models...
Next, I prompted ChatGPT to generate two sentences that are semantically similar:
- "The sun set slowly in the west, painting the sky with shades of orange and pink."
- "As the day ended, the horizon was ablaze with a beautiful mixture of orange and pink hues."
The cosine similarity scores should reflect their similarity:
text-embedding-3-small
Cosine similarity : 0.8428169121751229
text-embedding-3-large
Cosine similarity : 0.860045753726878
text-embedding-ada-002
Cosine similarity : 0.9588945157630332
The highest similarity is noted with the text-embedding-ada-002 model, although the text-embedding-3 scores are more representative of actual semantic similarity.
Next, consider these two sentences that are entirely dissimilar:
- "A cat lounged lazily in the warm afternoon sun, purring contentedly."
- "Heavy rain clouds gathered ominously, signalling an approaching thunderstorm."
The results are as follows:
text-embedding-3-small
Cosine similarity : 0.014516211086674564
text-embedding-3-large
Cosine similarity : 0.035299185035971375
text-embedding-ada-002
Cosine similarity : 0.7360667138087893
The lower scores for the text-embedding-3 models indicate their ability to recognize dissimilarity, while the text-embedding-ada-002 model tends to produce higher values.
In conclusion, across all tests, text-embedding-3 demonstrates superior performance in accurately capturing semantic similarity and dissimilarity.
Chapter 3: Future Prospects
As we look ahead, the potential for remarkable text generation is on the horizon. The embedding models discussed in this article will facilitate more advanced natural language processing tasks. It seems likely that the accuracy of language-related tasks and other model generation tasks will see significant enhancements.
More Ideas On My Page
🧙♂️ I am an AI application expert! If you're interested in collaborating on a project, please reach out or book a one-on-one consulting call with me.
📚 Feel free to explore my other articles:
The video titled "OpenAI's Text-Embedding-3 in 7 Minutes" offers a quick overview of the Text-Embedding-3 features and how they can be applied.