Introduction
Short (but long…) weekend project, let’s scratch the langchain framework. Cause Yes you have one with ruby! The python project is of course way more advanced but you can still build RAG with it!
In this article:
- Write a custom knowledge chatbot from your terminal in Ruby!
- Chat with your stored pdf
Requirements
Install first (for linux ubuntu/debian):
Before starting with the installation process, make sure you have the following pre-installation requirements:
- Update and upgrade your system:
$ sudo apt update
$ sudo apt upgrade
- Install PostgreSQL: PostgreSQL Installation Guide
- Install Ruby (version 3.2.2): Ruby Installation Guide with RVM
- Install Bundler: Bundler Installation Guide
- Install pgvector for Linux (Ubuntu/Debian):
$ psql --version
# psql (PostgreSQL) XX.X (...)
# your_pg_version_number = XX
$ sudo apt install postgresql-your_pg_version_number-pgvector
pgvector SUPERUSER issue
If you encounter any issues with superuser permissions, follow the steps mentioned in the Server Fault post to add a superuser.
What you should do is follow Chris James’s answer:
Open a new terminal:
sudo -u postgres psql postgres
# type "\password" then postgres
Enter new password:
# type "ALTER ROLE postgres SUPERUSER;"
Open AI API Key
Make sure you also have your API key for OpenAI ready before proceeding with the installation.
If not visit https://platform.openai.com/api-keys, register, create and copy your newly created key nearby (it’s coming!).
Let’s write some code!
First steps – your project files
Create a directory and a Gemfile inside:
# Change directory to the desired project directory
cd go/to/my/project/directory
# Create a new directory named my_weekend_project
mkdir my_weekend_project
# Create a new empty file named Gemfile
touch Gemfile
# Create a new empty file named .gitignore
touch .gitignore
# Create a new empty file named .env
touch .env
Optional:
# Create a new empty directory named lib to include your Plain Old Ruby Objects!
$ mkdir lib
With Ruby, it’s Objects everywhere! Object Oriented Programing (what is it? read here: https://www.freecodecamp.org/news/what-is-object-oriented-programming/) advised but optionnal here.
Open the Gemfile, paste it this:
source 'https://rubygems.org'
gem 'langchainrb', "~> 0.9.2"
gem "sequel", "~> 5.77.0"
gem "pg", "~> 1.5.5"
gem "pgvector", "~> 0.2.2"
gem 'ruby-openai', "~> 6.3.1"
gem 'dotenv', '~> 3.0.2'
gem "pdf-reader", "~> 2.12.0"
In your terminal (in the project directory):
$ bundle install
Optionnal – Github
Write in your « .gitignore » file (just skip if you don’t want to add it to GitHub)
.env
Not sending your .env file (that contains the precious api key) is just mandatory <3
Showing a private credential publicly on your Github repository is really sad :,O
But it’s ok now, with the .gitignore !
Not Optionnal
Write in your « .env »:
OPENAI_API_KEY=nowayimwritingmyreelapikeyinthisarticlebutyesinthefile
Using environment variables, like the one in the .env
file, is a common practice in software development for storing sensitive information such as API keys, database credentials, and other configuration values. By using environment variables, you can keep this sensitive information out of your codebase and separate from your application logic. This adds an extra layer of security and makes it easier to manage and update these values without having to modify your code. Additionally, using environment variables helps to keep your codebase clean and maintainable by centralizing configuration settings in one place…. toooo long! Thanks GPT !
Nice ! Solve the bugs (coding world, sorry), then create a new file:
$ touch index.rb
Starting to code !
In your index.rb file:
# This line is used to load the Bundler gem and set up the project's load path.
require 'bundler/setup'
# This line tells Bundler to require all the gems specified in the Gemfile under the `:default` group, all gems actually...
Bundler.require(:default)
# This line loads the environment variables from the `.env` file into the project. BOOM!
Dotenv.load
Start configuring the Model you will use:
# rest of the code...
# Initialize a new instance of the Langchain OpenAI LLM
gpt_llm = Langchain::LLM::OpenAI.new(
# Set the API key using the value from the environment variable
api_key: ENV["OPENAI_API_KEY"],
llm_options: {
# Set the model temperature parameter, 0.0 = few hallucination risk
temperature: 0.0,
# Specify the chat completion model, who talks, 16k context window and cheap
chat_completion_model_name: "gpt-3.5-turbo",
# Set the dimension parameter, so the database will better choose the right text sequence to retrieve
dimension: 1536,
# Specify the embeddings model, who changes text to vectors in your vector database, super cheap one (02/2024)
embeddings_model_name: "text-embedding-3-small"
}
)
puts "LLM ready!"
Ok, in just a few lines your parameters, chat model, embeddings model and options set!
Next start the Vector Database (below gpt_llm):
That should be done, if not create your db from the terminal:
$ createdb my_cute_pg_vector_db
index.rb
# rest of the code...
# Define the name of the PostgreSQL database
db_name = "my_cute_pg_vector_db"
# Retrieve the PostgreSQL credentials from the environment variable, like a pro
username = ENV["POSTGRES_USER"]
password = ENV["POSTGRES_PASSWORD"]
# Construct the connection URL for PostgreSQL
url = "postgres://#{username}:#{password}@localhost:5432/#{db_name}"
# Initialize a new instance of Langchain Pgvector
pgvector = Langchain::Vectorsearch::Pgvector.new(
url: url, # Set the PostgreSQL connection URL
index_name: "Documents", # Specify the index name, add whatever
llm: llm, # Pass the previously initialized Langchain OpenAI LLM instance
namespace: nil # Set the namespace to nil, not using it, such a shame... kidding!
)
# No table yet, creating one will be better to save your vectors ;)
pgvector.create_default_schema
# Define the name of the PostgreSQL database
db_name = "my_cute_pg_vector_db"
# Retrieve the PostgreSQL credentials from the environment variable, like a pro
username = ENV["POSTGRES_USER"]
password = ENV["POSTGRES_PASSWORD"]
# Construct the connection URL for PostgreSQL
url = "postgres://#{username}:#{password}@localhost:5432/#{db_name}"
# Initialize a new instance of Langchain Pgvector
pgvector = Langchain::Vectorsearch::Pgvector.new(
url: url, # Set the PostgreSQL connection URL
index_name: "Documents", # Specify the index name, add whatever
llm: llm, # Pass the previously initialized Langchain OpenAI LLM instance
namespace: nil # Set the namespace to nil, not using it, such a shame... kidding!
)
puts "Database instance is on!"
# No table yet, creating one will be better to save your vectors ;)
pgvector.create_default_schema
puts "Schema created!"
Put credentials in your ‘.env’ file:
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
Look at the « pgvector SUPERUSER issue » if you don’t know how to update your credentials… « How to change my postgres password? » in your search engine is working nicely 😉
Add the knowledge as pdf documents
Now your db is ready… ok, just a bit empty. Better to embed a pdf file 😀 (skip if you don’t have pdf)
It’s pretty easy, if you use what’s included:
# rest of the code...
my_pdf = Langchain.root.join("/path/to/a/file.pdf")
pgvector.add_data(paths: [my_pdf])
puts "Loaded!"
But it’s an array… 2 files? Below it you can code:
my_pdf_2 = Langchain.root.join("/path/to/a/file/in/pdf.pdf")
my_pdf_3 = Langchain.root.join("/path/to/a/third/file/in/pdf.pdf")
pgvector.add_data(paths: [my_pdf_2, my_pdf_3])
puts "More inside! Finally!"
Cooool! (sorry i’m old, lol :3) 3 pdfs memorized!
Add the knowledge as text
No pdf? You have text? Nice! The langchainrb gem just gets your back:
# Add plain text data to your vector search database:
pgvector.add_texts(
texts: [
"Chuck Norris does in fact use a stunt double, but only for crying scenes.",
"Chuck Norris went skydiving and his parachute didn't open. Chuck took it back for a refund."
]
)
puts "Text added!"
Search the database
Now is the time to ask a question to the database.
# rest of the code...
# Display a message prompting the user to enter a query
puts "Write the query: \n"
# Read and store the input as the query
query = gets.chomp
# Set the value of k to specify the number of results to be retrieved
# 1 = 1 chuck norris fact
k = 1
# Perform a similarity search using the Pgvector instance
search_result = pgvector.similarity_search(
query: query, # Pass the user-entered query
k: k # Specify the number of results to be retrieved
)
Share with the ai
Cool, and now question + entry from the database. Just need your favorite (or not?) gpt model to get both and answer with the help of its « long-term » memory.
# rest of the code...
# Retrieve the content from the search result
content = search_result[1].content
puts "Retrieved: #{content}
# Initialize the thread, keeps track of messages in a conversation
thread = Langchain::Thread.new
# Initialize a new instance of Langchain Assistant
assistant = Langchain::Assistant.new(
llm: llm, # Pass the Langchain OpenAI LLM instance
thread: thread # Specify the thread for the conversation
)
# Construct the prompt with context, query, and placeholder for the answer
prompt = "Context: \n
#{content}\n
Query: \n
#{query}\n
Answer: "
puts prompt
# Add the prompt as a user message to the conversation
assistant.add_message(content: (prompt), role: "user")
# Run the assistant to generate a response based on the context and query
assistant.run
puts "Waiting for answer..."
puts "Assistant's Response: #{assistant.thread.messages.last.content}"
Wooohoo ! Finally, it is aliiiive! You can only ask one question actually…
Put in a loop, more interface, features to show colors, as you like 😉
To run it:
$ ruby index.rb
Get the code (bit updated) on: https://github.com/alegarn/langchainrb_pgvector_test
First on blog: alegar.ch/blog
What’s up, its pleasant paragraph regarding media print,
we all know media is a wonderful source of information.
With havin so much content do you ever run into any issues
of plagorism or copyright infringement? My blog has a lot of
completely unique content I’ve either created myself or outsourced but it appears a
lot of it is popping it up all over the internet without my authorization. Do you know any techniques to help prevent content from being
stolen? I’d certainly appreciate it.