How to Build Knowledge Graphs With LLMs (python tutorial)

Sdílet
Vložit
  • čas přidán 19. 06. 2024
  • My previous video, showcasing how to use Large Language Models (LLMs) together with graph databases like Neo4J, had a lot of traction and many people were asking for details on the implementation.
    So, here’s a technical walkthrough from scratch, going through how I built the demo step by step.
    Watch the previous video here: • Advanced RAG with Know...
    Connect with me on LinkedIn: / johannesjolkkonen
    Code repository for this video: github.com/JohannesJolkkonen/...
    ▬▬▬▬▬▬ T I M E S T A M P S ▬▬▬▬▬▬
    0:00 - Intro
    1:03 - Environment overview
    2:47 - Neo4J & OpenAI Configuration
    7:50 - Helper functions
    9:15 - Identifying entities and relationships in data
    21:49 - Generating Cypher Statements
    33:20 - Running the full pipeline
    38:30 - Monitoring token consumption

Komentáře • 74

  • @123unhooked
    @123unhooked Před 6 měsíci +3

    Thank you so much for including also the price tag. Seeing that it is only a few cents that such proof of concepts accumulate to is really encouraging to go and try it out. Also everything else in this video was absolute gold! Really complete, really A-to-Z. Thank you so much.

  • @lhxperimental
    @lhxperimental Před 6 měsíci +6

    Thank you. Subscribed. There are so many AI channels that just talk how you can build this and that with LLMs and other word soup techniques, but don't actually show the process.

  • @Epistemophilos
    @Epistemophilos Před 6 měsíci +3

    Great video without any annoying music, thanks! Would be great to see a from-scratch video about how you actually use this in answering user questions, combining the graph data and LLM capabillities.

  • @chrisogonas
    @chrisogonas Před 2 měsíci +1

    That was a great share on knowledge graphs and LLMs. Thanks for putting it together.

  • @KCM25NJL
    @KCM25NJL Před 6 měsíci +5

    First of all, first class presentation! I've been considering building something quite similar to utilise knowledge graphs as a method of storing long term memory for ChatGPT by proxy of function calling. The vague idea I have floating in my head, is that the relationships could be automated using the LLM at inference time with some well formatted prompts. The last part of the video where you showcase cypher generation is probably the missing piece of the puzzle for connecting the storage (Neo4J) and this is great for updating the knowledge graph. I just hope you get a chance to showcase a bi-directional example of this in your part 2, as right now I'm not strong on knowledge graph ingestion in a way that makes sense for seamless LLM output when a knowledge graph is used to supplement it.

  • @vivalancsweert9913
    @vivalancsweert9913 Před 6 měsíci +1

    That was incredibly interesting and inspiring! Thank you!

  • @w_chadly
    @w_chadly Před 6 měsíci +2

    thank you for sharing this! this is going to help so many organizations who can't afford teams of data analysts. to have this much insight into their data.... 🤯

  • @kewpietonkatsu
    @kewpietonkatsu Před 6 měsíci +1

    Very good round up. I just started to follow you. This is as useful as papers.

  • @masked00000
    @masked00000 Před 4 měsíci

    Excellent video, thankyou for actually coding and showing the process, I was long stuck in this

  • @kamiln8398
    @kamiln8398 Před 6 měsíci +1

    on my todo list. Was looking it, many thanks !

  • @JelckedeBoer
    @JelckedeBoer Před 6 měsíci +1

    Thanks for sharing, great content!

  • @Music4ever326
    @Music4ever326 Před 6 měsíci

    Great Video! One thing i would like to add: I think that for larger datasets it is faster / more efficient to use the import tool that comes with Neo4j Aura, instead of executing a separate query for each node / relation.

  • @andydataguy
    @andydataguy Před 6 měsíci +1

    Looking forward to part 2

  • @sgttomas
    @sgttomas Před 4 měsíci

    amazing! thank you very much.

  • @andydataguy
    @andydataguy Před 6 měsíci +1

    You absolute Chad! 🙏🏾

  • @jebisetitut
    @jebisetitut Před 6 měsíci

    Great job!

  • @michalstun5187
    @michalstun5187 Před 4 měsíci

    Great video, thanks for that!
    I would also be interested in data quality here. I noticed a few inconsistencies in your input data. How did LLM cope with that? How accurate is the output knowledge graph? Can you make a more detailed comparison or share the output file pls?

  • @Doggy_Styles_Coding
    @Doggy_Styles_Coding Před 7 měsíci +1

    This is awsome, thanks for the work. Your setup reminds me of Windows XP :D

  • @hassanullah1997
    @hassanullah1997 Před 6 měsíci +3

    Great video Johannes, thanks!
    Just wondering whether you could do a retrieval example of this?
    Would be great to see how it compares to a vector store. When you read online theres a lot saying that retrieval is slower and less efficient but not sure what the think.
    Would be great to get your insight with a video to explain

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci +4

      Hey Hassan, thank you very much!
      I'm going to create a video of the retrieval component very soon, and that will be out within the next couple of days (:

    • @hassanullah1997
      @hassanullah1997 Před 6 měsíci

      @@johannesjolkkonen good stuff. Look forward to it brother :)
      Do you offer consulting on stuff like this? Working on a startup where I think KG will play a role. Do you have an email I can send information to?
      Keep up the good work :)

    • @andydataguy
      @andydataguy Před 6 měsíci

      @@johannesjolkkonenlet’s gooooo

  • @satri101
    @satri101 Před 5 měsíci +1

    Great video! I really learn a lot and enjoy the video. Thanks!

  • @EmilioGagliardi
    @EmilioGagliardi Před 6 měsíci +1

    excellent presentation. Love the detail and depth. Have you had to perform this on email text? I'm processing a large number of emails so the grammar and hence entities and relationships are not so clearly delineated. Was wondering if you've seen anything on performing this kind of LLM extraction on email texts or if you have any suggestions. I just started my journey into graph and its super cool, so def enjoy this content. Cheers,

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci +1

      Thank you Emilio!
      I haven't tried this myself, but it's a great idea. Extracting entities from the contents themselves might be hard for the reasons you said, but you could definitely get the sender-recipient relationships on a graph, and then use LLMs to add things like sentiment scores and email-thread summaries to those relationships. Maybe also segment email-conversations under some categories, to get a more high-level understanding of what themes people are discussing over emails.
      This is not so different to how graphs are already being used in social networks and content recommenders, but the LLMs definitely add more possibilities to the picture. Keep me posted if you end up doing something like this!

  • @NarenBalu-vx9hn
    @NarenBalu-vx9hn Před 2 měsíci

    I would like to hear your approach when it comes to structured csv files like sales report

  • @hy3na-xyz
    @hy3na-xyz Před 6 měsíci +2

    this is awesome bro just subscribed, is this the same for open source models; wanted to host using LM studio etc

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci

      Sure, it's exactly the same in principle! Of course the quality of entity extraction might vary between models, with OpenAI's models being top level.
      But this works fine with GPT3.5, so you could most likely get similar results with Llama 2 (:

  • @MrDonald911
    @MrDonald911 Před 6 měsíci

    For the second part with the chat, are you using code that basically converts natural language to cypher, then runs the cypher on the KG, returns the result, and uses that result to turn it to natural language ? That would be cool to see. Also I think I saw a similar video of yours, are you using Google's vertex AI by any chance with text-bison models ? Thanks

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci

      Yeah, that's exactly the idea. That video will be coming this Thursday or Friday!
      I haven't been using Vertex AI actually, so that video was probably not mine 😅

  • @user-fe6ns8mu5p
    @user-fe6ns8mu5p Před 3 měsíci

    Is there any idea on how we will scale it? If the documents is large then entity duplication can happen in graph right? How will we solve that?

  • @theuser810
    @theuser810 Před dnem +1

    Can we do this with LlamaIndex?

  • @joffreylemery6414
    @joffreylemery6414 Před 7 měsíci +1

    Great Job Johannes !
    I'm curious to discuss about the interest of going with Azur OpenAI instead of directly to OpenAI
    Thx, and once again, great job !

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci +5

      Hey Joffrey, thank you!
      The main reason is that on Azure, you can run the model with a dedicated and isolated endpoint, and all data that's passed to that endpoint is covered by Azure's enterprise-grade data privacy guarantees. Another thing which is important for a lot of companies is that you can choose the region in which this endpoint is hosted 🌍🌎

    • @mikelewis1166
      @mikelewis1166 Před 6 měsíci

      Can you point me to where Azure offers privacy guarantees to open AI users on their platform? We were considering it for our clients, but I cannot find documentation that seems to include Azure openAI under their data privacy terms,​@@johannesjolkkonen

  • @nikhilshingadiya7798
    @nikhilshingadiya7798 Před 7 měsíci +2

    Awesome man ❤❤❤🎉🎉 i love it the way you present if i have button to subscribe more i will hit it millions time

  • @u2b83
    @u2b83 Před 6 měsíci

    Cloud platforms are currently such an annoyance because of the "essential complexity" required for them to capture service charges and maintain security. It makes me think of the dial-up model of internet access back in the day or the clumsy process of installing printer drivers, instead of plugging the thing in and selecting print.

  • @moviecules1697
    @moviecules1697 Před měsícem

    What if you have the neo4j desktop version? How do you access it from your code?

  • @krishnakandula6587
    @krishnakandula6587 Před měsícem

    How are the prompt templates written, are there any guidelines to writing those ?

  • @johny1n
    @johny1n Před 2 měsíci

    Do you think you can make it for a github repo or incoming code?

  • @pichirisu
    @pichirisu Před 6 měsíci

    So what's the difference between drawing this on a whiteboard from statistical information and using a graph that does what you could draw on a whiteboard from statistical information?

  • @Milind-eu4fc
    @Milind-eu4fc Před 2 dny

    Can I integrate the same solution with Memgraph?

  • @user-bm5tf5gp8u
    @user-bm5tf5gp8u Před 2 měsíci

    Hi. Thanks a lot for the helpful content. I have a question. When I run the ingestion_pipeline() function, I only get two entities in Neo4j. Those that have a space in their name are not covered. Could you please guide me to solve the issue?

    • @johannesjolkkonen
      @johannesjolkkonen  Před 2 měsíci

      Hey!
      Yeah, it's important things to ensure is that the node & relationship types and property keys don't have spaces or special characters as they are not allowed and will fail. I recall I had some sanitization already in the cypher generation to ensure this, but in any case you should be able to fix it pretty easily by running some .replace(" ", "") in your code, getting rid of spaces in the names before generating & running the cypher statements.
      Hope this helps!

  • @3ti65
    @3ti65 Před 2 měsíci +1

    great video! is there a reason why you didn't have the LLM generate the cypher?

    • @johannesjolkkonen
      @johannesjolkkonen  Před 2 měsíci +1

      Thanks! Generating just the relationship-triplets is a simpler, less error-prone task for the LLM than generating complete Cypher with correct syntax. And because converting those triplets to Cypher is just a matter of some string-parsing, we might as well use python for that.
      It's always a good idea to do as much as possible with just plain old code, using LLMs just where necessary. A bit more work maybe, but a lot more reliable (:

    • @3ti65
      @3ti65 Před 2 měsíci

      @@johannesjolkkonen I see, makes sense. Appreciate the answer! :)

  • @ryanslab302
    @ryanslab302 Před 6 měsíci +1

    Make your text as big as possible when sharing your screen. Thank you for your video.

  • @AshWickramasinghe
    @AshWickramasinghe Před 6 měsíci +1

    Is it possible to use a free alternative for Neo4j?

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci

      Hey! As mentioned in the tutorial, Neo4j Aura offers your first instance for free, up to 1 million nodes.
      I'm not aware of a graph database that would offer free and unlimited capacity

  • @PijanitsaVode
    @PijanitsaVode Před 2 měsíci

    40 mn de cuisine, no linguistics, but that's the trend.
    All in all, one can copy it all except
    - Azure subscription
    - entity typing and some Neo4J setup
    - data sources
    ?

  • @MegaNightdude
    @MegaNightdude Před 4 měsíci

    😊

  • @Yogic-ignition
    @Yogic-ignition Před 7 měsíci +2

    few questions here:
    1. Can i retrieve the source documents from where the response was generated using graph knowledge?
    2. How can i avoid deduplication of data, if i am planning to ingest data from multiple sources (creating a data ingest pipeline)?
    3. How will i update the data present in the database which was true last week, but now it is not (like till last week my device had 3 ports, but now it has 6 ports)
    Thanks in advance!

    • @johannesjolkkonen
      @johannesjolkkonen  Před 7 měsíci +1

      Hey Mukesh, great questions!
      1. Sure. You can store some metadata about the source documents (document title, link, etc.), in each node and relationship, and then include that metadata in the query results when querying the database. You could also have the documents as nodes of their own, with relationships to all nodes that originate from the document.
      2. This is a key challenge in these applications. I haven't sadly done work on this myself yet, but here's a few good resources about this: czcams.com/video/dNGV4sLkOcA/video.html and margin.re/2023/06/entity-resolution-in-reagent/
      A lot of people have asked about this same thing, and it'll definitely be a topic for a video soon (:
      3. Assuming you have a good way to id the nodes, it's pretty easy to match the nodes by id and update their attributes with SET -statements. See here: neo4j.com/docs/getting-started/cypher-intro/updating/

    • @Yogic-ignition
      @Yogic-ignition Před 7 měsíci

      @@johannesjolkkonen thank you for your time and help, highly appreciate 😊 would be looking forward data deduplication and ingest pipeline video.
      *The notification bell icon is ON"

  • @karlarsch7068
    @karlarsch7068 Před 6 měsíci

    well, thank you. that were very interesting 40mins.
    are you aware that your mic picks up the noise your arms make on the desk? and i dont know if "python" in a title is very smart, it scares at least me ;) nah, its probably fine, just kidding

  • @dougclendening5896
    @dougclendening5896 Před 5 měsíci

    I'm a little confused. Why would you want to do it this way instead of using a hardcoded config of value types and their relationships?
    You're calling it unstructured data, but it's anything but unstructured. It has clear fields and values.
    So, I'm trying to understand the benefit here.

    • @johannesjolkkonen
      @johannesjolkkonen  Před 5 měsíci

      Hey Doug, that's a valid question. It's true that the markdown-files here are structured very neatly, with headings that work almost like fields. In this case, you could get something similar with just standard text-parsing, extracting values from the markdown based on the headers.
      However, the approach of using LLMs can be generalized to more complex situations, working with longer and messier documents (like pdfs) where the entities and relationships are more implicit, and text-parsing won't get you there.
      Hope that makes sense!

    • @dougclendening5896
      @dougclendening5896 Před 5 měsíci

      @johannesjolkkonen it does make sense. I was weighing the cost:benefit ration of using an LLM and token processing for such neatly structured data. It would be very expensive vs parsing them with a config.

  • @ramon1664
    @ramon1664 Před 6 měsíci +4

    you probably should not show your openai key like that

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci +12

      I re-generated all the credentials shown here before publishing, so these ones don't work anymore (:
      But it's a great point that I probably should've mentioned in the video, always rotate your creds!

  • @u2b83
    @u2b83 Před 6 měsíci +1

    So this is what happens with the data from the quarterly HR forms/questionnaires lol

  • @luisdanielmesa
    @luisdanielmesa Před 6 měsíci +4

    ...horrible injection vulnerability

    • @dinoscheidt
      @dinoscheidt Před 6 měsíci +11

      It’s a simple concept video… in a Jupyter notebook… without tests or anything. Injection risks are really the far end of hurdles for public production code here.

    • @123unhooked
      @123unhooked Před 6 měsíci +6

      That is just rude. Not saying that it is wrong, but criticizing it here is just so out of place. Shame on you for not respecting the quality product this guy offered.

    • @jgcornell
      @jgcornell Před 4 měsíci

      That’s why we need to take data cleansing seriously in AI

    • @MrCocobloco
      @MrCocobloco Před 3 měsíci

      😂

    • @trinityblood5622
      @trinityblood5622 Před 3 měsíci

      This guy must be a hardcore relational database guy having no relationship's outside his primary/foreign key. Don’t put such constraints in your life bro... Remove duplicate elements from your life and travers the nodes of real world concepts. 😂

  • @aravindarjun4814
    @aravindarjun4814 Před 6 měsíci

    Basicaaly am getting this error Running pipeline for 11 files in project_briefs folder
    Extracting entities and relationships for ./data/project_briefs\AlphaCorp AWS-Powered Sales Analytics Dashboard.md
    Error processing ./data/project_briefs\AlphaCorp AWS-Powered Sales Analytics Dashboard.md: Connection error.
    Extracting entities and relationships for ./data/project_briefs\AlphaCorp Customer Support Chatbot.md
    Error processing ./data/project_briefs\AlphaCorp Customer Support Chatbot.md: Connection error. While am executing the result can you help me with fixing it.

    • @johannesjolkkonen
      @johannesjolkkonen  Před 6 měsíci

      Hey Aravind!
      That seems like the entity extraction / LLM step is working, but there's an issue connecting to Neo4j. I would check that
      - Your neo4j-instance is running
      - Your connection url is in the correct format (neo4j+s://{your-database-id}.databases.neo4j.io:7687)
      - Your username and password are correct

  • @RedhookHellraisers
    @RedhookHellraisers Před 6 měsíci

    Souomolinen sisu!