Data Analysis with Python for Excel Users
Vložit
- čas přidán 25. 07. 2024
- A common task for scientists and engineers is to analyze data from an external source. By importing the data into Python, data analysis such as statistics, trending, or calculations can be made to synthesize the information into relevant and actionable information. See apmonitor.com/che263/index.php... Newer versions of Pandas dropped "ix". Replace line 12 "sensors = data_file.ix[:,'s1':'s4']" with the loc function with "sensors = data_file.loc[:,'s1':'s4']". The source code on the web-site is updated as well.
- Věda a technologie
>>Import Korean_Language as kl
>>comment = kl.convert(Thank you So much ^^.)
>>print( comment)
>> "정말 감사함니다 ^^. "
I actually tried to run this in python.
smh.
EXCELLENT! Of course, I have many many questions but it's up to me to dive into panda's documentation to find the answers.
Very clearly explained and I hope it will be helpful for many of us.Thanks for doing the video.
PYTHON BEGINNER: Excellent! It was explained so well that now at least i can do all the basic import, save, defining etc. Already tried with my data and feel so happy that after like 1 year now I can do some stuff :-)
if data_file.ix[:, 's1':'s4'] doesn't work. Try replacing it by data_file.loc[:, 's1', 's4']
Is there a better way of renaming the average column? I did it this way, which works:
>> list(result)
>>[time', 's1', 's2', 's3', 's4', 0]
>>result.columns=['time', 's1', 's2', 's3', 's4', 'average']
Nice job. Thank you for posting this.
Awesome! Thank you for the upload
That was really a useful video......Basics of Pandas for file handling........cool stuff
very clearly explained example.Thanks!
Thank you. This is excellent.
If there were 2 subscription buttons I'd hit them both.
Newfound love!
Thanks, Anno!
Great video, thank you so much.
You are a good teacher
Great video, thanks!
this has been of great help. thanks
Not sure if anyone ran into this issue, but Pandas need the openpyxl module in order to work with excel files.
You don't have to import it, just make sure it's installed in your environment:
pip install openpyxl
Excellent tip! Thanks for including this.
Best Jupyter tut
thanks for this video. quick question, this use case seems absurdly easy to do in excel. why might one use numpy/pandas for this in the real world?
+Cali Flower, Excel is probably the best tool if you are doing it one time and the data set is small. Excel is going to be difficult to use for large or complex data sets where a few lines of code in Python will do the same thing and much faster. If you need to repeat the same analysis on multiple data sets then Python is also a clear winner.
Excellent answer.
1. Automation and scalability.
2. Python and its packages are easily usable across various OS Platforms and most importantly linux.
3. You save MS office license costs.
I use DB of 900k lines with formulas, best is to not keep formula in each cell were are needed but insted only in the first row. Don't get me wrong, I will try to implement Py for future but for now this is what I'm doing.
Thanks for the video!
great explanation, really appreciated (y)
Thank you so much; prose and informative.
Darn it, where is that second like button !
Thanks for the positive feedback.
A tip : watch series on Flixzone. I've been using them for watching loads of movies during the lockdown.
@Jacob Roman yup, been using Flixzone} for since november myself :D
@Jacob Roman Yea, been watching on Flixzone} for months myself =)
@Jacob Roman Yup, have been using Flixzone} for months myself :)
Thank you so much for this
Thanks a lot for the information. Please suggest a way to assign a name to the 'avg_name' column in the saved excel/csv sheet.
result.columns.values[-1] = 'avg_name'
the ix function replacements are iloc() and loc() functions
Thanks for the tip!
No thanks for you it was very helpful me
Is there any vedio like this but for data visualization?
Here are basic tutorials on generating plots in Python: apmonitor.com/che263/index.php/Main/PythonPlots
Thanks-great help. At 9.30-9.40, my subtracted time data keeps coming out as 0. I presume this is due to rounding issues? (I'm using python repl). Is there any way to correct for this?
The numbers are different but I need to display more digits to see the difference. Here is one way to see the difference: print('{0:.15f}'.format(val)) where val is the floating point number.
I don't understand anything that you are talking about, but the script worked... so... cool :)
+Stuart Briscar, the same example in Numpy may be a little easier to understand. apmonitor.com/che263/index.php/Main/PythonDataAnalysis
Very good! Thanks
can we use for eg: content.head(3) to print 3 rows
Yes, that is correct.
you use np.mean for average, but how to do + - */ for specific data
You can do operations on data with single values x[2,4]*y[3] or on matrices with X@Y or X+Y.
Thank you very much!
Hey there, I was just trying to follow along here and the code entered in cell 5 throws an error for me 'DataFrame' object has no attribute 'ix'. Any idea why and how to fix and proceed. Sooooper new to this.
Try "sensors = data_file.loc[:, 's1':'s4']" instead. Pandas updated their code. Here is the full source code: apmonitor.com/che263/index.php/Main/PythonDataAnalysis?action=sourceblock&num=2 that you can find with more examples at apmonitor.com/che263/index.php/Main/PythonDataAnalysis
Does this work with Python 3? Just tried it w/ Pandas 1.0.1 and getting error on this line.
sensors = data_file.ix[:,'s1':'s4']
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
----> 1 sensors = data_file.ix[:,'s1':'s4']
2 print(sensors[0:6])
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.__getattribute__(self, name)
5275
5276 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'ix'
Try "sensors = data_file.loc[:, 's1':'s4']" instead. Here is the complete code that works with later versions of Pandas: apmonitor.com/che263/index.php/Main/PythonDataAnalysis?action=sourceblock&num=2
sorry i am a begginer, id like to learn more and this routine, tryed to follow and had this message: "FileNotFoundError: File b'data_with_headers.csv' does not exist"
i think i should extract the files in a certain diretory, could you guys help me out?
You can download the data_with_headers.csv file from the zipped archive at apmonitor.com/che263/uploads/Main/python_with_pandas.zip - don't forget to extract the folder (right click...extract to...).
How do you know to use num py when doing the average? Do all mathematical computations of your data require a reference of num py?
Numpy is a very common package for data analysis in Python. It is considered a base package that is also used by many other packages for mathematical calculations, such as calculating averages. There is also Scipy, Pandas and others: apmonitor.com/che263/index.php/Main/PythonDataAnalysis
Thank you!
Thanks a lot!
Kudos To The Master!!!!!!
You have to press SHIFT+ENTER to activate each line of code (for example at 5:00)
+John2000, thanks for the tip on the shortcut key combination. CTRL+ENTER is another one. One activates and gives a new cell below while the other just activates.
APMonitor.com,thank you. I didn't know about CTRL+ENTER. Very useful!
@APMonitor.com what is the software or device you are using for handwriting.. looks pretty sleek
Here is one of the devices that I use: czcams.com/video/YLRVZXedSlc/video.html I created that video a while back and have newer computers but the idea is the same.
Thank you so much , you are awesome
thank you sir...i am going into python but still thinking if web or data science which will help me...am a computer science student ..thank you
Most of my background is in data science - you can start with my course at apmonitor.com/che263 or check out some courses at Coursera or Udemy. Web programming is also valuable but does require a different skill set.
great video! could you do one for tensorflow? subbed!
Here is content on TensorFlow: apmonitor.com/do/index.php/Main/DeepLearning
Hi. Would this work for Python 3.x also? Or are there any changes
Yes, this should work. The only minor change between Python 2.7 and Python 3+ is to use parenthesis for the print statements.
thanks
So how do I go about creating a header for a column of data that was computed with python? For instance when you had avg_row, the header by default is 0, but how can you change that to something like "Average"
+Kevin Le, you can either write the header and then append to it or else use the header argument. There is some help here: docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html
Hi Kevin, I did it like this:
result = result.rename(columns={'time': 'Time', 's1':'Sensor 1', 's2':'Sensor 2','s3':'Sensor 3', 's4':'Sensor 4', 0: 'Average'})
You can name just one of the columns or all of the columns with this method
Thanks for this response - I just realized that my previous response was for the NumPy version of this video at czcams.com/video/Tq6rCWPdXoQ/video.html Thanks for providing the appropriate response for the Pandas version.
How do i import and work the excel file in Python not in Jupiter Notebook?
It is the same code as in the Python notebook but you just need to save a text file as myScript.py and run it with "python myScript.py".
Hi, I cannot import csv file from my desktop. Do I need to save the csv file in a particular location? I am using enthought canopy code editor.
Save the csv file in the same directory location as your script file. You may also try installing Python 3.7 if you want to try another distribution. czcams.com/video/bXWlyOMYpRE/video.html
Thanks Sir @@apm it's work :)
Excellent
master!!
Could someone why time - time[0] reset the time back to 0? I'm confused as why this resets the time instead of the column just being 0
+A Patel, the value of time[0] is a single value from the very first row. It subtracts this value from all of the other time values so that the time sequence starts at zero and not the other large value at the beginning of the original data file. Python subtracts the time[0] from time[0], time[1], time[2], .... to the end and returns a new vector that starts at zero. Let me know if this isn't clear.
This is simply called standardization technique. You take 1st value as a standard value and other values are djusted according.
Here 1st values i.e. time[0] is taken as standard value and subtracted from each value of 'time' column for adjustment.
Hence the line is
time = time - time[0]
Hope you understood! :)
How can I insert instead of a column instead of a row at the bottom of the last column thanks
You can insert a new column with:
df.insert(col_num, "new_col_name", data)
You could use pandas.concat() or DataFrame.append() to add a new row.
I'm a beginner, I try to rewrite your codes in pycharm, but after 07:03 all of your codes cannot be runned by pycharm!
I cannot get the same results, is it possible to rewrite your codes in pycharm?
Could you download the source files from apmonitor.com/che263/uploads/Main/python_with_pandas.zip - make sure you unzip the files first before running the script. You may be missing the data file. Additional methods and code are available at apmonitor.com/che263/index.php/Main/PythonDataAnalysis
Great video!
+John V, thanks!
赞一个
Yesterday I installed jupyter. The nymph and panda packages must be installed separately, right?
Yes, packages are installed separately. However, if you pip install a package that has a dependency then the dependency will be automatically installed as well.
Thanks
Can I replace line [8] and [9] by
my_data = pd.DataFrame(time, sensors, avg_row)
Yes, that syntax also works. You would want to assign it to the variable result, however:
result = pd.DataFrame(time,sensors,avg_row)
because that is used later. You can download the source code from apmonitor.com/che263/uploads/Main/python_with_pandas.zip
APMonitor.com ya correct sir :)
Thank you
how do we remove particular emails which are containing with in the same column, I have tried a lot, but nothing comes up, could you suggest me?
I have to remove the singe domain "python@example.com" here @example.com I need to remove from my mails column
Will all of the lines of code work in Python 3?
Yes, this should also work in Python 3.
Amazing!!
Great
Nice
NameError Traceback (most recent call last)
in
1 # load data file
----> 2 data_file = pd.read_cav('data_with_headers.cav')
NameError: name 'pd' is not defined
You need to import pandas first as:
import pandas as pd
Also make sure that the file extension is correct. Should it be .csv instead of .cav?
thx
Hi sir,
How to plot pdf and cdf in ipython notebook
Here is some help on plotting a CDF: stackoverflow.com/questions/9378420/how-to-plot-cdf-in-matplotlib-in-python
cant find the source code plz upload it agian in your site so we can code along with you thanks for video
here it is: apmonitor.com/che263/index.php/Main/PythonDataAnalysis
6:47 My sub-woofer went crazy here
My headphones don't do the same but I definitely hear some sort of low frequency impact in the background.
How can we upload .gz file?
You can use the gzip package to extract the compressed files in Python: docs.python.org/3/library/gzip.html
best
If i want to read .txt file can I use the same pd.read_csv.Also i have a criteria like below:
REGION SITE_NAME FT_SHIPPED_LAST_HOUR READY_TO_SHIP TOTAL_FT_BACKLOG FT_BKLG_NEXT_CPT FT_BKLG_NEXT_TO_NEXT_CPT SNAPSHOT_DATETIME
North QNAA 77 24 33 8 25 2016-08-04 10:00:00
North QNAB 13 0 3 3 0 2016-08-04 10:00:00
North QNAC 0 0 0 0 0 2016-08-04 11:00:00
North QNAD 0 15 0 0 0 2016-08-04 12:00:00
If i want to select only the data which is of 10 am data how can we do that.
yes, you can read strings, headers, numbers, and dates with Pandas. Pandas is like Microsoft Excel for Python but with scripting features to process your data. Once you've imported the data, you can slice it to get a subset such as 10am data.
yes i have done by this
import datetime
import numpy as np
import pandas as pd
data_file = pd.read_csv('D_HRLY_FT_DATA.txt',sep='\t',index_col=False)
date_time = datetime.datetime.strptime('2016-08-04 11:00:00', "%Y-%m-%d %H:%M:%S")
date_time1 = datetime.datetime.strptime('2016-08-04 12:00:00', "%Y-%m-%d %H:%M:%S")
dff = data_file[data_file.SNAPSHOT_DATETIME==str(date_time)]
dff1 = data_file[data_file.SNAPSHOT_DATETIME==str(date_time1)]
now dff is having data for 11 and dff1 is having data for 12 but what if i want to compare the data from 11 am to 12pm and do some calculation
You can index and select data in Pandas. Here are some examples: pandas.pydata.org/pandas-docs/stable/indexing.html I also like to use Numpy for array slicing. Here is a brief tutorial: czcams.com/video/mOZ0UCeuRX4/video.html
Thaks so much!
I keep getting the following error:
AttributeError: module 'pandas' has no attribute 'read. is this because my csv is not in the same place as the model that i am running and how do i fix it?
Make sure it is 'read_csv' with an underscore and not 'read csv' with a space. Here is some additional help if that doesn't work: stackoverflow.com/questions/40554657/module-pandas-has-no-attribute-read-csv
Perfect. thank you and great work!
Hi nice video, really amazing the way you teach. Anyway I keep getting this error:
Traceback (most recent call last):
File "C:\Users\JOSE\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\__init__.py", line 26, in
from pandas._libs import (hashtable as _hashtable,
File "C:\Users\JOSE\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\_libs\__init__.py", line 4, in
from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
File "pandas\_libs\tslib.pyx", line 67, in init pandas._libs.tslib
ImportError: DLL load failed: No se encontró el proceso especificado.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "import_with_pandas.py", line 3, in
import pandas as pd
File "C:\Users\JOSE\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\__init__.py", line 35, in
"the C extensions first.".format(module))
ImportError: C extension: DLL load failed: No se encontró el proceso especificado. not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.
And I do not see how to solve this, I hope you can give me some ideas, thanks
It looks like there is a problem with your Pandas installation. You may have downloaded the source of Pandas but have not compiled it. I recommend that you use "pip install pandas" instead of trying to compile it yourself. There is additional information on how to pip install a package at apmonitor.com/pdc/index.php/Main/InstallPython (see 3rd video). You may consider uninstalling Python and start over with either an Anaconda distribution or else the Python.org distribution.
Thanks it worked, although I had to do the installation all over again but this time I did it with anaconda after I uninstalled all python packages. Cheers
Hacked i guess
?