I just started a master's in data analytics (I'm actually a teacher tho). I'm so glad I found this channel. So effing interesting. Seems like a hell of a time to get into this space.
Fantastic stuff!!! Can be applied to so many things…… thanks for enlightening us with such fantastic content, it’s a lightning speed growing technology and there’s not a lot of information on the subject….. what I’d like to see is proper fine tuning via conversation history that gets saved and referenced in a separate vector database from the document analysis…. Reminds me of the early web! Everything was to be done….
I came across your channel and it is exactly what i have been searching for. Keep up the great work. Small request. Can we get a similar video but for pdf?
Great Video, can you make one that uses an open source LLMs instead of GPT4 for handling larger pandas datasets having hundreds of thousands of records as in actual production scenarios for orders. Thanks!
I personally feel like ChatGPT is not the best AI tool for data analysis work. Writing documentation for code and then having copilot write the actual code goes like a million mph, and you don’t pay per token.
@@rabbitmetricsI’m a little disappointed you couldn’t point out here how chatgpt is a demo implementation of GPT-4 and not the same as openAI apis for it where you set your own temps
It is an interesting concept and I hope it improves with time. Currently, It just dont work for so many examples. A lot of parsing errors, log chains of retries, plain wrong answers.
Great video, really informative! I have a question regarding the dataframe - does Open AI have access to the data? I'm curious if a company has data and wants to use this kind of process, does Open AI have access to the data? Or does this process adhere to GDPR regulations?
A random anecdote: in order to move yourself up on the waitlist for access to bing chat (gpt4), you should set Microsoft as your default for everything, starting with your browser, then with Microsoft Wallpapers. Then the app on your phone etc... what would a pesky GDPR reg do once the ai has has root acces to all machines because its gatekept otherwise?
These are really excellent videos thank you. It's just a shame you are not sharing the workbooks. It really helps to learn when you can process and adjust the code as you go!
I believe the output parser error is related to the format of the output that it's attempting to parse. Unless you have set up the proper tools to handle some specific formats (like graphs), it might fail.
Thing is, that these are still very basic queries that any human can quickly write a pandas code for. For complex queries it's getting lost. Moreover, both gpt3 and 4 are prone to do basic math mistakes. But of course the overall direction is pretty awesome, I'd love an agent to write reliably buch of pandas and sql boilerplate code for me a daily basis.
How does an organization share proprietary data with OpenAI and have the LLM do work? We need a middleware obfusticating the data by some distributing normalization such that OpenAI can't reverse engineer the context over time as well as take the secret, top secret data, otherwise none of this is scalable
It's not, your best bet would be an implementation of local version alphacah llm or something use it , Even then i don't think this is the best approch may be it can take coloumn name and datatype (+metadata) and spits out a formula to which operation is performed on local machine rather openai for both data security and answer integrity also what if the file is extremely large like a parquet file which even gpt 4 can t process in which case something like spark can do the transformation or calculation for us, it be great product tbh
HI, In this approach is the data being shared with OpenAI? My understanding is we are using pretrained model and creating an agent for the environment.
Thank you for the excellent video. Doing analytics on a dataframe os my own, with 3 thousand columns, I came accross the tokens limit for the model I used (chatgpt 3.5). Is there anyway to overcome it?
there was one catch while using gpt-4, if we pass multiple dataframes it just considering header in the prompt and thinking those are the rows, in all dataframes , could you please do a video on how to pass multi dataframes to gpt-4 pandas data agent?
I’d err on the side of caution when using a service with this pricing model. This wasn’t a problem but using OpenAI embeddings can get pricy if you’re processing large amounts of textual data
Amazing! now OpenAI just included in their "code interpreter", there is any way to use Panda Dataframe with a local model, like stablevicuna, redpajama or mpt-7b? thank you. Liked and subscribed.
I have a table with hundrands of rows and 20 columns. I even created a smaller table with only the first 5 rows for testing and I still get this annoying error: InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 12432 tokens. Please reduce the length of the messages. It's impossible for me to work with any csv file like this. What can I do?
Can we use this to get answers from a set of questions if we have customer reviews instead of sales data. Like we can ask any question related to a product or summary of the reviews to thousands of comments.
Hey you video is very informative and a great tutorial, i have a question if i use Visual Studio will the code work inside VS as it does in Jupyter. Or i should write the commands in CLI since i have python on PATH? I'm very new to coding and python i hope my question makes sense. Anyway thank you for the great videos!!
Thanks for the video. Based on my understanding, the openAI GPT is able to do the task solely based on the file name and informative column names, because as you might know these models are constrained by the context length and so they aren't able to parse the whole file and really analyse the data. In my opinion, we aren't yet doing something magical heare. We can get most of the results only using some basic pandas functions like df.describe() or df["Column"].value_counts(). What do you think of this?
anyone tried getting the agent to create graphs in say matplotlib? im getting 'OutputParserException: Could not parse LLM output' error. i can do it using exec on python code generated using normal chat completion but not this way. good vid tho.
There is a new package called panda ai which does effectively the same thing but in fewer lines of code. Under the hood, it is probably doing the same thing.
@@rabbitmetrics this dude is right, in my opinion this seems to better than what langchain currently offers through pandas_dataframe_agent. the behavior of pandas dataframe agent is very inconsistent, especially when Action: print(pyton_repl_ast(...)) is called (I often get is not a valid tool). I imagine both are doing the same thing with recursive calls to refine the dataframe operations being called and passed to the python repl. I am going to investigate the pandasai documentation as it seems to be much more straightforward and tractable for a non-contributor
This is what I was searching for. Keep it up. Very informative no bullshit
Great to hear! Thanks for watching
I just started a master's in data analytics (I'm actually a teacher tho). I'm so glad I found this channel. So effing interesting. Seems like a hell of a time to get into this space.
Found a gem channel, will learn so many new things now.
Thanks for sharing. Awesome!
Fantastic stuff!!! Can be applied to so many things…… thanks for enlightening us with such fantastic content, it’s a lightning speed growing technology and there’s not a lot of information on the subject….. what I’d like to see is proper fine tuning via conversation history that gets saved and referenced in a separate vector database from the document analysis…. Reminds me of the early web! Everything was to be done….
Very well explained. Very compact tutorial. Keep going !
Appreciate the support! Thanks for watching
Please upload more videos regarding langchain plss❤
I came across your channel and it is exactly what i have been searching for. Keep up the great work. Small request. Can we get a similar video but for pdf?
Nice work.
However, for newbies like me, PLEASE EXPLAIN HOW YOU GOT TO THAT .ENV FILE SECTION WHERE YOU INPUTTED YOUR API KEY.
That is crazy good, thanks for the video. New sub here!
Does langchain send this entire csv file to openai?
Great Video, can you make one that uses an open source LLMs instead of GPT4 for handling larger pandas datasets having hundreds of thousands of records as in actual production scenarios for orders. Thanks!
I personally feel like ChatGPT is not the best AI tool for data analysis work. Writing documentation for code and then having copilot write the actual code goes like a million mph, and you don’t pay per token.
I agree copilot is superior right now, but things are moving fast
isn't copilot powered by OpenAI codex?
We need to consider the application of these tools by analysts who may not possess programming skills. This is where their usefulness truly shines
Chatgpt ≠ GPT4
If you studied, you’d understand this
@@rabbitmetricsI’m a little disappointed you couldn’t point out here how chatgpt is a demo implementation of GPT-4 and not the same as openAI apis for it where you set your own temps
It is an interesting concept and I hope it improves with time. Currently, It just dont work for so many examples. A lot of parsing errors, log chains of retries, plain wrong answers.
Great video, really informative! I have a question regarding the dataframe - does Open AI have access to the data? I'm curious if a company has data and wants to use this kind of process, does Open AI have access to the data? Or does this process adhere to GDPR regulations?
A random anecdote: in order to move yourself up on the waitlist for access to bing chat (gpt4), you should set Microsoft as your default for everything, starting with your browser, then with Microsoft Wallpapers. Then the app on your phone etc... what would a pesky GDPR reg do once the ai has has root acces to all machines because its gatekept otherwise?
Same question. But what if all the data is stored on Azure cloud. In a way, Microsoft has access to all our data.
These are really excellent videos thank you. It's just a shame you are not sharing the workbooks. It really helps to learn when you can process and adjust the code as you go!
Great video. Will this also work with GPT 3.5 API? Or it needs 4? Thanks
Great video.
I believe the output parser error is related to the format of the output that it's attempting to parse. Unless you have set up the proper tools to handle some specific formats (like graphs), it might fail.
Thing is, that these are still very basic queries that any human can quickly write a pandas code for. For complex queries it's getting lost. Moreover, both gpt3 and 4 are prone to do basic math mistakes.
But of course the overall direction is pretty awesome, I'd love an agent to write reliably buch of pandas and sql boilerplate code for me a daily basis.
Agree, but I expect the LLMs to improve to the point where it will write accurate queries consistently
That was great!
If the DataFrame is too long for the chatgpt UI prompt, does that mean by using Langchain you can bypass this limit?
Very interesting. Does giving it a specific file to analyze solve the hallucination problem?
Insane!
How does an organization share proprietary data with OpenAI and have the LLM do work? We need a middleware obfusticating the data by some distributing normalization such that OpenAI can't reverse engineer the context over time as well as take the secret, top secret data, otherwise none of this is scalable
It's not, your best bet would be an implementation of local version alphacah llm or something use it ,
Even then i don't think this is the best approch may be it can take coloumn name and datatype (+metadata) and spits out a formula to which operation is performed on local machine rather openai for both data security and answer integrity also what if the file is extremely large like a parquet file which even gpt 4 can t process in which case something like spark can do the transformation or calculation for us, it be great product tbh
And yes pricing on data operations on open ai server is definitely not sustainable
A company called Palantir does this
@@mattforsythe5037Palantir created a Data Security Middleware to communicate with external LLMs using NLP apis?? Whoa!
HI, In this approach is the data being shared with OpenAI? My understanding is we are using pretrained model and creating an agent for the environment.
Thank you for the excellent video. Doing analytics on a dataframe os my own, with 3 thousand columns, I came accross the tokens limit for the model I used (chatgpt 3.5). Is there anyway to overcome it?
What are the advantages of using this method over using OpenAI's advanced analytics plugin?
Currently not much. Today I would look into using AutoGen for automating data analysis with OpenAI
there was one catch while using gpt-4, if we pass multiple dataframes it just considering header in the prompt and thinking those are the rows, in all dataframes , could you please do a video on how to pass multi dataframes to gpt-4 pandas data agent?
I'm exploring different ways to work with Pandas efficiently at the moment, will make a video about this at some point
So does langchain use GPT to type a sql query, queries the database, then outputs the result? Thats pretty impressive.
Could you please tell me how much did the GPT4 API cost for this task? I have only used 3.5 before and heard that GPT4 is much more expensive.
It's like 30X more expensive than the 3.5 turbo model... curious how many tokens these requests soak up!
$20/month for chatgpt plus
Thanks so much! Do you have a github or colab link for the file?
Your welcome! Don’t have a repo yet but will post a link
hello! would love that!
Looks interesting. One question I have is whether there will be substantial costs for using the OpenAI's models for large data sets?
I’d err on the side of caution when using a service with this pricing model. This wasn’t a problem but using OpenAI embeddings can get pricy if you’re processing large amounts of textual data
Amazing! now OpenAI just included in their "code interpreter", there is any way to use Panda Dataframe with a local model, like stablevicuna, redpajama or mpt-7b? thank you. Liked and subscribed.
Will your dataset be uploaded to OpenAI if you do this? If so, how do I keep my dataset private?
I have a table with hundrands of rows and 20 columns. I even created a smaller table with only the first 5 rows for testing and I still get this annoying error:
InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 12432 tokens. Please reduce the length of the messages.
It's impossible for me to work with any csv file like this. What can I do?
Hey man than's for your video!
I'm getting an error saying
AuthenticationError:
Output is truncated
do you know how to fix it?
Can we use this to get answers from a set of questions if we have customer reviews instead of sales data.
Like we can ask any question related to a product or summary of the reviews to thousands of comments.
Indeed, have a look at this video czcams.com/video/UO699Szp82M/video.html
Does these order data gets sent to chatgpt? Is there anyway to keep it local? Vicuna?
Hey you video is very informative and a great tutorial, i have a question if i use Visual Studio will the code work inside VS as it does in Jupyter. Or i should write the commands in CLI since i have python on PATH? I'm very new to coding and python i hope my question makes sense. Anyway thank you for the great videos!!
VS Code supports Jupiter, so you can run the notebook directly in VSCode. I do it all the time.
@@devinwalker9202 thanks so much dude I literally found out about that an hour ago and then I saw your comment. I wish you the best thank you.
Does pinecone or any other service store and have access to your data? This would be important to know for the use of enterprise applications.
Yes, they have access to the embedding vectors and the metadata about each embedding
curious does it also give graphs if you ask it?
We have problems with the limit of tokens?
How can we save the df to pinecone and query them
can we give multiple datframes as input?
Thanks for the video. Based on my understanding, the openAI GPT is able to do the task solely based on the file name and informative column names, because as you might know these models are constrained by the context length and so they aren't able to parse the whole file and really analyse the data. In my opinion, we aren't yet doing something magical heare. We can get most of the results only using some basic pandas functions like df.describe() or df["Column"].value_counts(). What do you think of this?
I think you can combine his video with this one, czcams.com/video/6WE09Ihdn9M/video.html you can get around the plugin waiting list problem.
Great stuff, have you been successful with using sklearn with this methodology?
just wonder why you can use gpt-4 for model name ?
anyone tried getting the agent to create graphs in say matplotlib? im getting 'OutputParserException: Could not parse LLM output' error. i can do it using exec on python code generated using normal chat completion but not this way. good vid tho.
You found anything similar but using SQL yet?
Great video. But no one is going to work with this workflow
i see a error in the last step Must provide an 'engine' or 'deployment_id' parameter to create a
how to do it with nested json instead of CSV?
There is a new package called panda ai which does effectively the same thing but in fewer lines of code. Under the hood, it is probably doing the same thing.
Nice, thanks. WiIl check it out
@@rabbitmetrics this dude is right, in my opinion this seems to better than what langchain currently offers through pandas_dataframe_agent. the behavior of pandas dataframe agent is very inconsistent, especially when Action: print(pyton_repl_ast(...)) is called (I often get is not a valid tool). I imagine both are doing the same thing with recursive calls to refine the dataframe operations being called and passed to the python repl. I am going to investigate the pandasai documentation as it seems to be much more straightforward and tractable for a non-contributor
Where do I get the file with the code?
is it working with any model other than OpenAI models ?
Yes. Langchain provides wrappers around various models, see python.langchain.com/en/latest/modules/models/llms/integrations.html
Can you display the results in html tags
Link to the notebook? Thanks
can you please post the dataset?
langchain charged me $7 in api calls in 30 minutes of testing because I forgot to specify a stop string :(
I got ‘False’ at the very beginning
Check if the keys are loaded using os.getenv('API_KEY')
Excemm
ImportError: cannot import name 'create_pandas_dataframe_agent' from 'langchain.agents'