Postgres pgvector Extension - Vector Database with PostgreSQL / Langchain Integration
Vložit
- čas přidán 25. 06. 2024
- Blog Post: bugbytes.io/posts/vector-data...
In this video, we'll look at the pgvector extension for PostgreSQL, that allows you to turn your Postgres database into a vector data-store!
pgvector adds the vector data-type and distance computation operators (L2, inner product, and cosine distance) to allow you to query for "similar" items in the vector-space.
We'll see how to set pgvector up in a Docker container, and will see how to integrate it with Langchain via the PGVector object.
We'll look at how to take a piece of text, split it into chunks, create embeddings from those chunks using OpenAI, and then store the embeddings in the Postgres vector database. We'll also see how to query the database for vectors/documents that are similar to a text prompt/query.
☕️ 𝗕𝘂𝘆 𝗺𝗲 𝗮 𝗰𝗼𝗳𝗳𝗲𝗲:
To support the channel and encourage new videos, please consider buying me a coffee here:
ko-fi.com/bugbytes
📌 𝗖𝗵𝗮𝗽𝘁𝗲𝗿𝘀:
00:00 Intro
00:41 Introduction to pgvector for PostgreSQL
03:23 Splitting text file into chunks with Langchain RecursiveCharacterTextSplitter
06:10 Using OpenAI to get embeddings for each chunk with OpenAIEmbeddings object
10:54 Setting up pgvector and PostgreSQL in a Docker container
16:38 Using the Langchain PGVector object to connect to PostgreSQL
21:47 Finding similar vectors to a query in pgvector
25:29 Querying pgvector with SQL to get cosine distances
𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮:
📖 Blog: bugbytes.io/posts/vector-data...
👾 Github: github.com/bugbytes-io/
🐦 Twitter: / bugbytesio
📚 𝗙𝘂𝗿𝘁𝗵𝗲𝗿 𝗿𝗲𝗮𝗱𝗶𝗻𝗴 𝗮𝗻𝗱 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻:
Blog Post: bugbytes.io/posts/vector-data...
pgvector: github.com/pgvector/pgvector
pgvector DockerHub image: hub.docker.com/r/ankane/pgvector
State of the Union text: github.com/hwchase17/chroma-l...
OpenAI Embeddings: platform.openai.com/docs/guid...
Langchain Vectorstores: python.langchain.com/docs/mod...
#python #langchain #datascience #postgresql
Very thorough walkthrough. Thanks!
Most excellent. I am now a monthly supporter. You deserve to be paid.
Brilliant content. Concise, no waffle. Thank you
Thanks a lot!
Thank you so much for sharing the details, Your informative CZcams videos have been incredibly helpful. Great job on putting together such valuable content! Keep up the outstanding work and continue enlightening us. We truly appreciate your contributions!
Thanks a lot, glad to hear that the videos have been helpful - thanks for watching and supporting the channel!
Just yesterday I thought "pgvector would be interesting to see a video about".
And then you publish this! 👏👏👏🥳
Thank you Lyle. 🙏
Thanks a lot Sil!
Extremely complex concepts published in the simplest way! I could run the whole notebook typed without errors! Thank you for the clarity!
Thanks a lot, really happy to hear that! Cheers!
Dude thanks for making this. I always learn something from your videos. Thank you!
Thanks a lot, glad to hear that! Thank you for the support!
Really appreciate your efforts you have put in for this tutorial
Thanks a lot!
Fantastic comprehensive walkthrough of how to use PGVector and Python to work with vectors for your AI stuff 😀
Thanks a lot Mattias!
@@bugbytes3923thank YOU, now looking into the one where you use Django as the front end to all of this 😊
What a fantastic video.
Thank you, BugBytes !
Thanks a lot!
Thank you so much for this tutorial! Very, very high quality!
Thanks a lot, glad you liked!
clear and well structured. you have an amazing style of teaching.
Awesome to hear, thanks a lot!
Straight forward explanation. Thank you
Thanks a lot!
oh man.. it has been a while and it is still the best tutorial out there.. It will be great to see something with pgvector again with django-ninja...
Thanks a lot! I'd love to do some more on PGVector - if anyone has any project ideas, let me know here!
This was super helpful, thanks!
Glad to hear that - thanks for watching!
thank you so much for this content!
Thanks a lot for watching!
thanks for this, it was a great help!
Glad to hear that! Thank you for watching.
fantastic content, thank you! would be great if you could do a more in depth video on how do indexing (HNSW) with the same jupyter notebook example
Thanks man, Great content!
Thanks a lot!
powerful libs - yes its almost as if AI 'needs' a highly artistic oracle to 'shape' it's 'stance' in order to focus on the goals/need of the User/app
Great job! Extremely usefull ! tks.
Thanks a lot!
Very interesting!
Thanks!
Great Video Sir
Thanks a lot!
Thanks this is very helpful
Thanks a lot!
Good contents. Thanks.
Thanks a lot!
i am having this error, pls help me how to solve this
Could not open extension control file "/PostgreSQL/16/share/extension/vector.control": No such file or directory.extension "vector" is not available.
loving your videos man, thank you for clear concise explanation of these topics. Do have any videos using RAG and agents in Django? I am using Django RestAPI and have been struggling with an agent controller that work fine in the notebook but then times out in my API request with the exact same code usin Char ReAct Description?
thanks. really helpful
Thanks for watching!
@@bugbytes3923 Hey I have this error. do you know why?
connection_string = "postgresql+psycopg2://user:pass@localhost:5432/db"
collection_name = 'financial_qa'
db = PGVector.from_documents(
embedding=instructor_embeddings,
documents=texts,
collection_name=collection_name,
connection_string=connection_string
)
File ~\.conda\envs\financial_qa\lib\site-packages\langchain\vectorstores\pgvector.py:578, in PGVector.from_documents(cls, documents, embedding, collection_name, distance_strategy, ids, pre_delete_collection, **kwargs)
574 connection_string = cls.get_connection_string(kwargs)
576 kwargs["connection_string"] = connection_string
--> 578 return cls.from_texts(
579 texts=texts,
580 pre_delete_collection=pre_delete_collection,
581 embedding=embedding,
582 distance_strategy=distance_strategy,
583 metadatas=metadatas,
584 ids=ids,
585 collection_name=collection_name,
586 **kwargs,
587 )
File ~\.conda\envs\financial_qa\lib\site-packages\langchain\vectorstores\pgvector.py:453, in PGVector.from_texts(cls, texts, embedding, metadatas, collection_name, distance_strategy, ids, pre_delete_collection, **kwargs)
445 """
446 Return VectorStore initialized from texts and embeddings.
447 Postgres connection string is required
448 "Either pass it as a parameter
449 or set the PGVECTOR_CONNECTION_STRING environment variable.
450 """
451 embeddings = embedding.embed_documents(list(texts))
--> 453 return cls.__from(
454 texts,
455 embeddings,
456 embedding,
457 metadatas=metadatas,
458 ids=ids,
459 collection_name=collection_name,
460 distance_strategy=distance_strategy,
461 pre_delete_collection=pre_delete_collection,
462 **kwargs,
463 )
File ~\.conda\envs\financial_qa\lib\site-packages\langchain\vectorstores\pgvector.py:213, in PGVector.__from(cls, texts, embeddings, embedding, metadatas, ids, collection_name, distance_strategy, pre_delete_collection, **kwargs)
210 metadatas = [{} for _ in texts]
211 connection_string = cls.get_connection_string(kwargs)
--> 213 store = cls(
214 connection_string=connection_string,
215 collection_name=collection_name,
216 embedding_function=embedding,
217 distance_strategy=distance_strategy,
218 pre_delete_collection=pre_delete_collection,
219 **kwargs,
220 )
222 store.add_embeddings(
223 texts=texts, embeddings=embeddings, metadatas=metadatas, ids=ids, **kwargs
224 )
226 return store
TypeError: langchain.vectorstores.pgvector.PGVector() got multiple values for keyword argument 'connection_string'
@@bugbytes3923 nvm. The cause is there is another connection_string on virtual environment
Is there any way I can use the data from the Postgres database directly, instead of using documents data?
Thanks,
I am having this error when creating the "vector" extension
ERROR: Could not open extension control file "C:/Program Files/PostgreSQL/16/share/extension/vector.control": No such file or directory
Have you solved this problem? Pls help me to do this
Thank you so much for great video!, can please cover on Anthropic Claude with PGVECTOR. That would be a great help !
Is there any way to do hybrid search with this? Meaning, is it possible to do something like keyword search or some other filtering before doing semantic similarity? Or is this kind of feature only available in specific paid vector databases?
Fantastic video! Would be interesting to see a follow up on how this might work with Django?
Thanks a lot - I am planning a short video on Django and pgvector. There's a useful extension that integrates the two - coming soon!
@@bugbytes3923 Could I ask what the extension is so I could have a look while you're creating the video. Love your content!
@@helloh6 Thanks a lot! It's the same library I installed in this video to work with pgvector - this library has modules for working with Django - more details here:
github.com/pgvector/pgvector-python#django
@@bugbytes3923 Amazing, thanks!
thanks for the video!
do you know if there's a way to save the database locally after it's been initalised with `db = PGVector.from_documents(
embedding=embeddings, documents=chunks, connection_string=connection_string
)`?
e.g. Faiss has a save_local() function
Fantastic! Where is the Jupyter notebook?
Excellent video, any chance instead of OpenAI ada embeddings, how about S-Bert to generate embeddings? possible code snippet would be appreciated. Thanks and love your content.
Edit:
- Problem 1: My postgres container is within WSL2, which I cannot connect with PgAdmin from Windows
- Solution : connect pgAdmin page container with pgvector container.
- Problem 2: Object of type PosixPath is not JSON serializable
- Solution:Change my POsixPath to string and pass to TextLoader
Supabase uses their vec client for postgres/pgvector. This does not need docker but we are then limited to their free plan storage of 50MB. What do you think?
What PostgreSQL permissions or operator functions are required or recommended for pgvector?
hey ! how do i get the uuid of records of langchain_pg_embeddings table to delete it later.
Is there any tutorial where I already have a table in postgres ? I found that I uploaded all the dicuments and created the index without langchain and now I want to acces that database but I found that all the tutorials starts from raw data and create the vectorstore in the process.
What is the rationale behind calling embed_query vs embed_documents?
Is there any way to store in custom schema defined instead of public schema??
is it possible to do something using chroma db to load sql data in to vector db there are not a lot of resources and i need to learn that
Hi, how to change default table names? like langchain_pg_collection to something else
great video. How does this compare to FTS for search? When would you want to use that over this? Would they get the same results in this case for example?
Thanks! The mechanism for FTS is different, so there's no guarantee that the same results would be reached. Maybe I could do a video quickly comparing these methods!
@@bugbytes3923 Would be a nice video I think. One of the advantages of FTS over this for searching products would be that if you have it on a online website you can't be ddos't to increase your API cost a lot.
Super interesting video. I’m wondering if you know about how to prompt properly to openai to generate the vectors. By this I mean if there are ways to improve the quality of the vectors to query so the answer can be more precise. Thanks
with embeeding models there is no prompting, These are not chat models.
@@nedyalkokarabadzhakov5405 so basically the embedding needs to be created by the most accurate text that you can provide right?
where can I get the notebook for this?
GPT FineTune and Embedings
Hi, i followed the steps you mentioned in blog but facing issue while connect and insert vectors to postgres database
Please find the error below:
texts = [d.page_content for d in documents]
^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'page_content'
hi, do you know what dimensions value should I use when creating vector column?
In this video, it should be 1536-dimensions. We used OpenAI's latest embedding model to create the embeddings, which has output dimensions of 1536.
platform.openai.com/docs/guides/embeddings/second-generation-models
@@bugbytes3923 thank you
Hey, Can you also try to experiment with Langfuse and how it can be leveraged ?
I'll need to look into Langfuse. But possibly! I'm planning more GPT/vector/langchain videos.
-p 5432:5432
The postgresql and its pgvector have the same port mapping, is that right?
Yes, that's right!
Will there be any way to use the postgresql db tables directly instead of txt files?
With LLMs - I'll release a video this week on Retrieval Augmented Generation, where we use the DB table with Langchain and use the results of a DB query as context to an LLM prompt.
Waiting!!@@bugbytes3923
how to do this with docs, csv and pptx files?
hello. great video, helped me a lot with exactly what I was looking for!
Keep up the good work.
I have a question. I followed you video and I downloaded docker image, I have my pgadmin4, but when i try making extension, it says: Could not open extension control file "C:/Program Files/PostgreSQL/15/share/extension/vector.control": No such file or directory.extension "vector" is not available
Do you maybe know what is going on?
Thank you in advance
Thank you!
Regarding your problem: did you add the port mapping in the Docker run command? From port 5432:5432?
I suspect that pgAdmin is trying to connect to Postgres running locally on your machine, rather than in the Docker container. Do you have Postgres running on your machine locally? You may need to stop that if Postgres is running on the same port in the Docker container.
Not sure though, but let me know if you get it fixed or if you're still stuck!
@@bugbytes3923 oh, thank you sooo much!
postgres did run locally on my machine on same port as doocker container. so i had to stop those proceses, and now it works!
can't wait for the django video with pgvector! keep up the good work
I am also facing it can you please add steps so I can also solve this....
Thanks you in advance
@@ajaypalsingh6329 if you have both docker and local postgre in yout pgadmin, you should stop those procceses within the task manager. Go to procceses and end all procceses regarding your postrgres. That is what worked for me honestly.
I dont know if you have the same issue.
@@ajaypalsingh6329 windows or Mac?
Hi, I find this video very informative and easy to understand.
However, I am getting the below error
when downloading pgVector image: Error response from daemon: pull access denied for arcane/pgvector, repository does not exist or may require 'docker login': denied: requested access to the resource is denied"
try this:
docker pull pgvector/pgvector:pg16
how to close pgvector connection, after everything is done.
There's no mention of installing PostgreSQL first.
The installation is done via the Docker commands.
@@bugbytes3923 ers\Administrator> docker run --name pgvector5-demo -e POSTGRES_PASSWORD=test -p 5432:5432 ankane/pgvector
popen failure: Cannot allocate memory
initdb: error: program "postgres" is needed by initdb but was not found in the same directory as "/usr/lib/postgresql/15/bin/initdb"
Despite following the post steps several times, the error still appears. Maybe it's because I'm using Win10.
Hi .. Im getting below error while running the CREATE EXTENSION vector query in Database. Can you please help,
ERROR: Could not open extension control file "C:/Program Files/PostgreSQL/16/share/extension/vector.control": No such file or directory.extension "vector" is not available
ERROR: extension "vector" is not available
SQL state: 0A000
Detail: Could not open extension control file "C:/Program Files/PostgreSQL/16/share/extension/vector.control": No such file or directory.
Hint: The extension must first be installed on the system where PostgreSQL is running.
could you find a solution to this issue?
Kindly help me with the below error..
When I try to execute CREATE EXTENSION vector I'm getting the below error
ERROR: Could not open extension control file "/usr/share/postgresql/16/extension/vector.control": No such file or directory.extension "vector" is not available
ERROR: extension "vector" is not available
SQL state: 0A000
Detail: Could not open extension control file "/usr/share/postgresql/16/extension/vector.control": No such file or directory.
Hint: The extension must first be installed on the system where PostgreSQL is running.
Note - both Postgres and pgvector running in docker
This: CREATE EXTENSION vector; , worked for me
And i used this docker: docker pull pgvector/pgvector:pg16
Typo in the blogpost:
`CREATE EXTENSION vector;` instead of `CREATE EXTENSION pgvector;`
super awesome! It will be great to see this integrated with django-ninja to build a chat with pdf (but without using chatgpt --something similar to this czcams.com/video/rIV1EseKwU4/video.html which is essentially from primordial privategpt....