Semantic-Text-Splitter - Create meaningful chunks from documents

Sdílet
Vložit
  • čas přidán 4. 07. 2024
  • In this video I want to show you a package with uses BERT to create chunks - semantic-text-splitter
    Repo: pypi.org/project/semantic-tex...
    Code: github.com/Coding-Crashkurse/...
    Timestamps
    0:00 Introduction
    0:57 Code walkthrough

Komentáře • 39

  • @codingcrashcourses8533
    @codingcrashcourses8533  Před 4 měsíci

    I made a mistake in this video: This Splitter does NOT accept a full model, but only accepts the Tokenizer. Sorry for that. So I am still looking for a good way to create LLM based chunks :(

    • @nmstoker
      @nmstoker Před 4 měsíci

      It's a shame but I think the underlying idea of what you were after makes sense. It amuses me that so often people try LLMs with RAG outputs that even a typical human would struggle with!

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci +5

      @@nmstoker I will release a video on how to make an LLM based splitter next video :). When nobody else wants to do it, lets do it ourselves :)

    • @nathank5140
      @nathank5140 Před 4 měsíci +1

      Following for that. I have many meeting transcripts with discussions between two or more participants. The conversation often is non linear. Topics revisited multiple times. I’m trying to find a good way to embed the content. Thinking maybe to write one or more articles on each meeting. Then chunk those. Not sure, would appreciate any ideas

    • @vibhavinayak8527
      @vibhavinayak8527 Před měsícem

      @@codingcrashcourses8533 Looks like some people have implemented 'Advanced Agentic Chunking' which actually uses an LLM to do so! Maybe you should make a video about it?
      Thank you for your content, love your videos!

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před měsícem

      @@vibhavinayak8527 currently learning langgraph. But still struggle with that

  • @henkhbit5748
    @henkhbit5748 Před 3 měsíci +1

    Yes, a much better chunking approach. thanks for showing👍

  • @micbab-vg2mu
    @micbab-vg2mu Před 4 měsíci +3

    Thank you for the video - ) I agree random chunking every 500 or 1000 token gives random results.

  • @kenj4136
    @kenj4136 Před 4 měsíci +3

    Your tutorials are gold, thanks!

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci

      Thanks so much, honestly i am quite surprised that so many people watch and like that video

  • @MikewasG
    @MikewasG Před 4 měsíci +1

    Thank you for your effort! The video is very helpful!

  • @Munk-tt6tz
    @Munk-tt6tz Před 2 měsíci

    Exactly what i was looking for, thanks!

  • @ashleymavericks
    @ashleymavericks Před 4 měsíci

    This is a brilliant idea!

  • @andreypetrunin5702
    @andreypetrunin5702 Před 4 měsíci

    Спасибо! Очень полезно!

  • @pillaideepakb
    @pillaideepakb Před 4 měsíci

    This is amazing

  • @znacibrateSANDU
    @znacibrateSANDU Před 4 měsíci

    Thank you

  • @ahmadzaimhilmi
    @ahmadzaimhilmi Před 4 měsíci

    Very intuitive approach towards RAG performance improvement. I wonder if the barchart at the end would be better off be substituted with 2-dimensional representation and be evaluated with KNN.

  • @user-lg6dl7gr9e
    @user-lg6dl7gr9e Před 4 měsíci +1

    We need a langchain in production course, hope you consider it!!!

  • @bertobertoberto3
    @bertobertoberto3 Před 3 měsíci

    Wow

  • @maxlgemeinderat9202
    @maxlgemeinderat9202 Před 4 měsíci

    Interesting! Saw the langchain implmentation. Do you prefer this one an could the tokenizer be any embedding model?

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci +2

      There is a difference between an embedding model and a tokenizer, hope you are aware of that. If yes, I didn´t understand the question

  • @moonly3781
    @moonly3781 Před 4 měsíci

    Stumbled upon your amazing videos and want to thank you for the incredible tutorials. Truly amazing content!
    I'm developing a study advisor chatbot that aims to answer students' questions based on detailed university course descriptions, each roughly the length of a PDF page. The challenge arises with varying descriptions across different universities, for similar course names and the length of each course. Each document starts with the university name and the course description. I've tried adding the university name and course description before every significant point, which helped when chunking by regex to ensure all relevant information is contained within each chunk. Despite this, when asking university-specific questions, the correct course description for the queried university sometimes doesn't appear in the result chunk, often because it's missing in the chunks. Considering a description is about a page of text, do you have a better approach to this problem or any tips ? Really sorry for the long question :) I would be very very grateful for the help.

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci +1

      It depends on your usecase. I think embeddings for 1 PDF are quite trashy, but if you need the whole document you can have a look at a parent child retriever. You embed very small documents, but pass the larger, related document to the LLM. Not sure what to do with the noise part, LLMs can handle SOME noise :)

  • @raphauy
    @raphauy Před 4 měsíci

    Thanks for the video. Is there a way to do this with typescript?

  • @thevadimb
    @thevadimb Před měsícem

    Why didn't you like the Langchain implementation of the semantic splitter? What was the problem with it?

  • @user-sw2se1xz6r
    @user-sw2se1xz6r Před 4 měsíci

    is it theoretically possible to have a normal LLM like llama2 or mistral do the splitting?
    the idea would be to have a completely local alternative running on top of ollama.
    i see that in the case of semantic-textsplitter it uses a tokenizer for that and i understand the difference.
    i am just curious if it would be possible.
    danke für die vids btw. learned a lot from that. ✌✌

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci +1

      Sure that is possible. You can treat that as normal task for the llm. I would add that the new chunks should contain characters an outputparser can use to create multiple elements from them.

    • @nathank5140
      @nathank5140 Před 4 měsíci

      @@codingcrashcourses8533what do you mean? Can you provide an example?

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci

      @@nathank5140 I will release a video on that topic on friday! :)

  • @mansidhingra4118
    @mansidhingra4118 Před 28 dny

    Hi, thanks for this brilliant video. Really thoughtful of you. Just one question, when I tried to import HuggingFaceTextSplitter. I received an importError. -- "ImportError: cannot import name 'HuggingFaceTextSplitter' from 'semantic_text_splitter'" Any idea how will it work?

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 27 dny

      Currently not. Maybe they changed the import path. What Version do you use?

    • @mansidhingra4118
      @mansidhingra4118 Před 27 dny

      @@codingcrashcourses8533 Thank you for your response. The current version I'm using for semantic_text_splitter is 0.13.3

  • @jasonsting
    @jasonsting Před 4 měsíci

    Since this solution creates "meaningful" chunks, implying that there can be meaningless or less meaningful chunks, would that then imply that these chunks affect the semantic quality of embeddings/vector database? I was previously getting garbage out of a chromadb/faiss test and this would explain it.

    • @codingcrashcourses8533
      @codingcrashcourses8533  Před 4 měsíci +3

      I would argue that there are two different kind of "trash" chunks. 1. there is are docs that just get cut off and lose their meaning. 2. chunks are too large and cover multiple topics -> embeddings just don´t mean anything.