Refactoring A Data Science Project Part 1 - Abstraction and Composition

Sdílet
Vložit
  • čas přidán 27. 07. 2024
  • This is the first part of a 3-part miniseries in which I refactor a hand-written digit recognition data science project based on the MNIST dataset to improve the software design so it's easier to reuse and adapt. In this first part I cover using abstract classes and protocols to better separate the various aspects of the application, and I talk about function composition as a generic solution to dealing with data pipelines.
    Thanks to Mark Todisco for helping out with preparing the example. The code I worked on in this video is available here: github.com/ArjanCodes/2021-da....
    Links to Pytorch and Scikit learn functional composition tools:
    - pytorch.org/docs/stable/gener...
    - scikit-learn.org/stable/modul...
    Part 1: • Refactoring A Data Sci...
    Part 2: • Refactoring A Data Sci...
    💡 Here's my FREE 7-step guide to help you consistently design great software: arjancodes.com/designguide.
    🚀If you want to take a quantum leap in your software development career, check out my course The Software Design Mindset: www.arjancodes.com/mindset.
    🎓 Courses:
    The Software Designer Mindset: www.arjancodes.com/mindset
    The Software Designer Mindset Team Packages: www.arjancodes.com/sas
    The Software Architect Mindset: Pre-register now! www.arjancodes.com/architect
    Next Level Python: Become a Python Expert: www.arjancodes.com/next-level...
    The 30-Day Design Challenge: www.arjancodes.com/30ddc
    🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.
    💬 Join my Discord server here: discord.arjan.codes
    🐦Twitter: / arjancodes
    🌍LinkedIn: / arjancodes
    🕵Facebook: / arjancodes
    👀 Channel code reviewer board:
    - Yoriz
    - Ryan Laursen
    - Sybren A. Stüvel
    🔖 Chapters:
    0:00 Intro
    1:29 Explaining the code
    6:41 About data science
    7:35 Separating experiment tracking from the rest of the code
    16:52 Improving data type consistency
    19:44 Improving the way variables are handled
    22:26 About function composition
    29:03 Final thoughts
    #arjancodes #softwaredesign #python
    DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Komentáře • 217

  • @ArjanCodes
    @ArjanCodes  Před 9 měsíci

    💡 Here's my FREE 7-step guide to help you consistently design great software: arjancodes.com/designguide.

  • @deez_gainz
    @deez_gainz Před 2 lety +270

    I think its the data science, natural science and non IT related engineering people would actually benefit the most from your software design centric videos. I`m one of them and we literally code spaghetti on the daily basis without ever getting taught the SOLID principles =). Thanks and you're making better those that listen!

    • @althayrL
      @althayrL Před 2 lety +21

      I'm a professional data scientist and I'm following the channel since the beginning. It was essential to me in learning to be a better software engineer, even if this is not my main job requirement but my every day tool...

    • @selimrbd
      @selimrbd Před 2 lety +9

      Same here, data scientist greatly benefitting from this channel

    • @TheMightyOprah
      @TheMightyOprah Před 2 lety +6

      Agreed - working as a data scientist who is proficient in data wrangling, ML, etc., but definitely lacking in solid software development principles, more videos like these would help me a ton!

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +35

      Thanks! It’s definitely an area I’d like to do more videos on in the future.

    • @Jordan-bi4tn
      @Jordan-bi4tn Před 2 lety +4

      Same, very happy to see Arjan covering this topic as it’s what I was looking for few months ago when I first discovered his channel

  • @ZaneSelvans
    @ZaneSelvans Před 2 lety +21

    Yes PLEASE do more videos like this at the intersection of data science / ETL pipelines and software engineering. It's extremely helpful for those of us who have come into building software from another adjacent field and are now struggling with big messes of our own making :)

  • @tunapedia
    @tunapedia Před 2 lety +13

    I am a senior data scientist, and I benefit from all your videos. Building architecture, productionizing and scaling up ML models is challenging. It requires good software engineering practices and a good understanding of the full software development stack. Good work as usual Arjan.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      Thank you, glad you liked it!

    • @DanielTobi00
      @DanielTobi00 Před 7 měsíci

      Hello Tunapedia,
      I came across your insightful comments on this video. I'm currently deepening my skills in data science and recently secured second place in an NLP competition on Zindi. I admire your expertise and would appreciate any guidance or insights you can provide on potential job opportunities in the field.
      Thank you.

  • @mhmdjouni3669
    @mhmdjouni3669 Před 2 lety +7

    I'm a data scientist and machine learning researcher, and looking into code design and refactoring from your perspective is very helpful for me in terms of coding! Thanks a lot

  • @loumote
    @loumote Před 2 lety +3

    The "Unsatisfying cliffhanger" is me realizing I now have to go through a lot of refactoring because I've done this lazy single-variable function chains waaay too much... Great job as always, thank you Arjan !

  • @shopsmartin5851
    @shopsmartin5851 Před 2 lety +16

    All data science programming I’ve ever seen is usually written for a one-off experiment with very little principles applied, whether SOLID or reproducibility. The code is often not object oriented and is more functional - and written in declarative linear steps in one script. Even this code you are starting with is in better shape. I’ll be watching for sure to see these software development principles applied to that sort of programming style.

  • @VikasGuptacherie
    @VikasGuptacherie Před 2 lety +2

    I really liked this novel method of "Code Refactoring" & "Code-Roast" to look things from software best practices and see how to correct these common mistakes. I would like to see more such video.

  • @visualapproach7155
    @visualapproach7155 Před 2 lety

    I love these refactoring series. So informative. Thanks, not only to Arjan, but to the people who submit their code to literally be picked apart and rebuilt.

  • @michaelt6922
    @michaelt6922 Před 2 lety +4

    Thank you for your content Arjan, I have intermediate python skills but have been learning a lot from your refactoring videos. Moving to OOP for my projects has been a steep but rewarding curve. Thanks again!

  • @sai1921
    @sai1921 Před 2 lety

    I'm a simple man. I see Arjan post, I hit like button. As a DS student, this actually helps a bunch. Thanks brother!

  • @Michallote
    @Michallote Před rokem

    Arjan I'm at awe at you ease of reworking things just by looking at them. And it works every time! I just recently followed all your advice in a program I'm developing and it took me a day just to get the thing running again in the new format. We are incredibly lucky to have you teaching us this stuff. Most courses will say over and over the design principles but getting to see them applied so naturally really makes them stick. Thank you so much

  • @anzei331
    @anzei331 Před 2 lety +2

    Great vid, was looking forward to this for a while since you mentioned on Reddit that you had plans to get into ML/DS from software engineering perspective. Much better to refactor a project which is a real world scenario, rather than simple hypothetical examples which are abundant.

  • @joaopedrorocha5693
    @joaopedrorocha5693 Před rokem

    This helper function to compose is a gold nugget . I think it should go into the functools module so we could simply import it. The idea is so intuitive that it wouldn't be a problem if it wasn't explicitly defined on the codebase.

  • @gregorybutcher2647
    @gregorybutcher2647 Před 2 lety

    How on earth does this man not have more subscribers. I mean most people would benefit it's their problem if they don't watch these lmao I'm just glad I'm one of the first to hear his wisdom.

  • @amir3515
    @amir3515 Před 2 lety

    Very stimulating and educational video. Love the pace. Thank you.

  • @anelm.5127
    @anelm.5127 Před 2 lety +6

    Learned the most out of your refactoring videos . Really enjoy them. Especially Solid principled in practice made them super easy to understand.

  • @1oglop1
    @1oglop1 Před 2 lety +1

    I love this, this video saves and the comments save me a lot of time returning code reviews to data people over and over! Now I can just send them here to explain what is not spaghetti!

  • @programmertheory
    @programmertheory Před 2 lety +1

    I remember dealing with MNIST data sets in college when I was learning Machine Learning. I was taking an OOP course at the same time and my first ML (Machine Learning) assignment was a single-layered neural network with 10 perceptrons. Even though I went object-oriented with the assignment it took forever to go through the training data and testing data, 12+ hours in total in runtime. It wasn't that accurate either, like 75-80%. However, I redid the assignment, abandoning most, if not all, OOP principles and going towards something more procedural and mathematical (linear algebra to be precise). There was a huge difference in my experience. The code was easier to read, easier to understand, and a lot faster, when going through the training and testing data in less than 1 second and was reaching 92-96% accuracy.

  • @aliwelchoo
    @aliwelchoo Před 2 lety +1

    As a data scientist that was already watching your content, definitely looking forward to this series!

  • @coert
    @coert Před 2 lety

    Once again, excellent stuff Arjan. Definitely going to work with the function composition!

  • @DrPizza92
    @DrPizza92 Před 2 lety

    I’m a JS guy but have learned so much from watching your videos. Thanks!

  • @xxshogunflames
    @xxshogunflames Před 2 lety +1

    Looking forward to part two! Learned a lot and will be rewatching

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Thanks Jonathan, glad you liked it!

  • @ShaderKite
    @ShaderKite Před 2 lety +1

    I'm loving it! Please continue doing videos like this one :D
    I'm learning a lot from it - your videos are one of the most valuable/useful ones I've seen for Python or software design in general

  • @sdar1988
    @sdar1988 Před 2 lety +3

    I always used coding as a tool to test my hypothesis. You videos put perspective into why and how writing code is much more than that. I am not a trained software engineer, but, professionally a data scientist. I feel your videos are really helping me fill glaring gaps in software design process while conceiving my data projects and this is important for the data science community as most are not from the software engineering background. Please make more videos in this series.
    Godspeed.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Hi Arjun, thank you, I'll definitely continue in this direction. I think there are a lot of things to cover, so stay tuned!

  • @leif_p
    @leif_p Před 2 lety +4

    Worth pointing out that both sklearn's Pipeline and torch's Sequential compose _classes_ satisfying certain interfaces and return _classes_ (with possibly different capabilities). Which is a bit more complicated than function composition, but usually necessary in real-world situations where the aggregate process needs more capabilities than just being Callable.

  • @alchemication
    @alchemication Před 2 lety +3

    This is actually what I do at work - working in a Data Science team as a Software Engineer with some prior ML knowledge. I have to tell you that the code you received for refactoring here is actually what I would consider a state of the art design ;- ) No offence to Data Scientists, I totally understand how complex their world is!! Hopefully as the discipline matures a bit more, and sadly more projects fail due to quick & dirty solutions - we will be all in a better place. Thank you for your work.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      You're most welcome and I absolutely agree with you - data science is a very complex field and it makes total sense that data science education programs have to spend all their time on data science concepts, leaving little room for software engineering practices!

  • @MCRuCr
    @MCRuCr Před 2 lety +93

    You shouldn't make pure data science/machine learning content, because there is already plenty of that.
    A sort of "Software design for data scientists [Dummies]" could be a great contribution!

    • @TheMightyOprah
      @TheMightyOprah Před 2 lety +14

      100% agree with a series on Software Design for Data Scientists!

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +29

      I agree - I also wouldn't feel very comfortable doing pure data science / ML stuff since that's not my main area of expertise. But I'll definitely think more about how design principles and patterns can be used in this setting!

    • @sergeiparshin9488
      @sergeiparshin9488 Před 2 lety

      ​@@peterdowdy174 Probably Kedro could be useful to combine notebook and code itself.
      P.S. Kedro - open-source Python framework for creating reproducible, maintainable and modular data science code

    • @alchemication
      @alchemication Před 2 lety +2

      @@peterdowdy174 Hey Peter, I have been struggling with this topic for a few years and ended up here: Notebooks are great for local/quick/dirty experiments, but not for a proper/production grade code. For many, many reasons... Once I accepted this - my life is a happier place ;) Greetings and all the best!

    • @alonyariv8999
      @alonyariv8999 Před 2 lety

      Yes please, that is such an important content to have

  • @niklase5901
    @niklase5901 Před 2 lety +1

    I am really intrested in design for data science applications. I used to be a programmer, but did other stuff for a lot of years, the reason I am back in programming is data science. But I find there is lack of practises that I am used to from programming applications lacking in the world of data science. So this is a great one!

  • @Tobbzn
    @Tobbzn Před 2 lety +46

    Some feedback: While seeing your face is always a bright point of any day, I still felt that you would often cut to a fullscreen camera view of yourself while talking about the code you just cut away from, which made it a bit hard to follow the structure of the code.
    Like, at 3:10 you said "You can see this happening here" during a cut where we literally can't see it happening, which caused a weird disconnect in my brain where I felt like I had to switch gears with each cut, trying to take in as much information as possible before the next cut would interrupt the reading.
    It's an interesting video, but these cuts made it hard to follow.

    • @cristopherfreitas762
      @cristopherfreitas762 Před 2 lety +1

      I totally agree with this.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +21

      Yes, I also noticed this a bit too late. Will make sure this is better in the next videos.

    • @BBB-zy6er
      @BBB-zy6er Před 2 lety +3

      @@ArjanCodes Your other videos, editing-wise, have excellent pace and I don't notice the cuts at all, making it easy to follow along. This one felt like the cat was standing on the "cut" key.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +3

      Haha, I did start working with a cat (read: video editor ;) ) since a few weeks. It’s clear we still need to fix a few things in the process, but I’m on it.

    • @leestoddart7014
      @leestoddart7014 Před 2 lety

      absolutely - this was really stopping me understand the process. Stay in the small box if you are talking about the specific code

  • @Bakobiibizo
    @Bakobiibizo Před rokem

    maybe not when this came out, but now is a helluva time to start doing data science material

  • @kevon217
    @kevon217 Před rokem

    Really cool compose function. Going to use that.

  • @iliqnew
    @iliqnew Před 2 lety

    Once more. A very useful and nice video! Thank you!

  • @MichaelTVickers
    @MichaelTVickers Před 2 lety +1

    I’ve been hunting for a nice way to do function composition in standard-library python for awhile and this version with type hints is 👍

  • @jessehalliday2948
    @jessehalliday2948 Před 2 lety

    I just love watching you delete lines of code, keep up the great and informative videos

  • @jeancerrien3016
    @jeancerrien3016 Před 2 lety

    Wonderful video! 🙏
    Among many other things, you've shown me three nice ways to compose a sequence of functions:
    1) with a torch network
    2) with a scikit-learn pipeline
    3) with functools.reduce
    I agree the third is very attractive. Some may find it a bit strange that the order of the functions switches, but that's not a defect in my eye.

  • @benjaminthorand9569
    @benjaminthorand9569 Před rokem

    PLEASE give us more from just this very content! Awesome videos, going to spread the word! : ]

  • @pawelkubik
    @pawelkubik Před 2 lety +10

    It's worth pointing out that those single-variable function calls are often preferred, because network composition is rarely purely sequential. In general, it is a DAG. For experimenting it's important to be able to quickly access intermediate results of the network and a chain of calls make it much easier. In practice it's more important to detect repeatable and meaningful patterns in the network and split them into separate classes, e.g. a network may consist of a sequence of 12 layers, but it could be conceptually easier to view it as a sequence of 4 blocks - 3 layers each.
    tl;dr - don't refactor out all single-variable function calls right away

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Good to know, thanks!

    • @pawelkubik
      @pawelkubik Před 2 lety +1

      In my experience, almost every new ML engineer start the journey from solving a very simple problem like classification and implement kind of a "Trainer" object. There is a lot of inversion of control to adjust certain parts of the experiments. It seems like a stable framework, but collapses pretty quickly when they try to do something more complicated.

    • @pawelkubik
      @pawelkubik Před 2 lety +1

      There are few popular frameworks that approach this a bit more maturely. I think would be interesting to see an analysis and comparison of libraries like Keras, Ignite and Pytorch Lightning from perspective of an experienced programmer. They all invent some kind of callback or hook mechanism to control data loading and model training.

  • @iliqnew
    @iliqnew Před 2 lety

    Yes please! More of these

  • @astronemir
    @astronemir Před 2 lety

    Hi Arjan, I’m an astronomer learning to code more properly, and I work exactly with code like this often. This was so unbelievably helpful. Thank you for starting this series and I’m looking forward to more like it.
    It’s difficult to prototype things in a Jupyter notebook, get it running, then refactor to something shareable and useable and understandable by others that may need to work with it. You’re teaching me a lot, keep it up!

    • @joaopedrorocha5693
      @joaopedrorocha5693 Před rokem

      I'm proto astronomer, passing through the same process as you :D

  • @AbhirupMishra
    @AbhirupMishra Před 2 lety +4

    I really loved this video. I work in Quantitative Finance, where we have to write a lot of code (usually in a scientific programming language, a.k.a Python), and I've benefited a lot from these videos. A lot of a code that I've encountered is usually a spaghetti code, and just starting to think of solving the problems from good design principles has really helped in increasing the flexibility, maintainability and readability of my code. I always look forward to watching these videos! Hopefully, you'd cover more advanced topics of Python and designing systems in the future.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Thanks, I'll definitely do more videos like this in the future!

  • @drhilm
    @drhilm Před 2 lety

    I wish I have seen this video two years ago. I write this kind of project all the time. I learned the hard way to do it like that.

  • @MateuszModrzejewski
    @MateuszModrzejewski Před 2 lety

    Fantastic video, I'm eager to watch the two next parts. From my PhD studies in AI I can tell the majority of research code in ML and AI is terribly written and barely readable, even with published works. The guidelines for clean ML code are just starting to emerge and at times I feel there's even more confusing ML config / scheduling / architecture tools released every day than confusing JS frontend tools (and there's a JS framework released almost every day lol). Good to see plain old good design being used in this context. Content like this is VERY valuable, hope to see more ML refactoring videos! All the best!

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Thanks and glad to hear you enjoyed the video! Let me know what you think of the other two. I'll certainly revisit more data science oriented content focused on design. Doing this miniseries was a lot of fun.

    • @MateuszModrzejewski
      @MateuszModrzejewski Před 2 lety

      @@ArjanCodes so I've already watched the other two and really enjoyed them as well . Very clean, understandable and applicable approach and I think your channel really nicely fills a gap in intermediate to advanced programming topics. I really appreciate the references to Dijkstra, Hoare, SOLID, GRASP etc. - super rare to see that on YT. I've also watched your Hydra video and I really like how it compliments this miniseries - Hydra is getting lots of interest in the community these days. Another tool that's growing in popularity and also could be interesting for you for a future video is PyTorch Lightning - it introduces an opinionated design into PyTorch and also aims to clean up some of the clutter which can be found in 90% of AI code.

  • @ingovb6155
    @ingovb6155 Před rokem

    Thanks for making this (and similar) videos. They are very helpful and insightful

    • @ArjanCodes
      @ArjanCodes  Před rokem

      Thank you Ingo, glad you liked the video!

  • @AdeelEjaz
    @AdeelEjaz Před 2 lety

    Really good video, very well explained, and I can see in comments below you have noted the jump cuts away from code. Really will make the video perfect! Thank you

  • @DistortedV12
    @DistortedV12 Před 2 lety

    Okay this video is gonna blow up imo

  • @red_cape.
    @red_cape. Před 2 lety +7

    I'm a newb in python, and being experienced in other languages it is hard to flip the switch to a new one, Arjan videos have beem crucial to my undestanding of the "Pythonic" way. Thanks man! Keep em coming ... I don't know if it is your focus here but would love to see you talk about a project using PyQt5 ;)

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +3

      Thank you, glad you like the videos and good topic suggestion!

  • @brunosompreee
    @brunosompreee Před rokem

    Thanks! I'm a Data Engineer and this helps a lot!

    • @ArjanCodes
      @ArjanCodes  Před rokem

      Thanks so much Bruno, glad it was helpful!

  • @ilyaster42
    @ilyaster42 Před 2 lety

    That's great video! Thank you a lot!

  • @SupernovaGiacomo
    @SupernovaGiacomo Před 2 lety

    Wow thanks Senpai! Will definitely share on my linkedin and with my data engineering team

  • @tonyli7014
    @tonyli7014 Před 2 lety

    Great topic!

  • @supratikchowdhury2107
    @supratikchowdhury2107 Před 2 lety

    Yes to more Data Science!

  • @garrywreck4291
    @garrywreck4291 Před 2 lety

    Great video!
    IMHO, a simple loop over functions list is much easier and readable:
    x = 12
    for func in (add_three, add_three, mul_two, mul_two, ):
    x = func(x)

  • @sergioquijanorey7426
    @sergioquijanorey7426 Před 2 lety +2

    Really nice video. When working with ml / ds problems, I always end up using ugly designs / hacks that makes the job done. An then refactoring is such a pain. Thanks you for this advice :D

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      Thank you Sergio, glad you liked it!

  • @Glitchiz57
    @Glitchiz57 Před 2 lety

    Great video Thanks !
    See you next week

  • @_shikh4r_
    @_shikh4r_ Před 2 lety

    I'm taking notes 📝

  • @marwensallem1397
    @marwensallem1397 Před 2 lety +4

    Nice video 😊 Hope it reaches all my data scientist colleagues.
    There are many similarities in machine learning projects, this makes me think of why there is no custom Design Patterns for ML projects ?

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +3

      Thanks! I'll try to come up with a few ideas for this and cover that in future videos.

  • @zeki7540
    @zeki7540 Před 2 lety

    Thanks Arjan!!

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      You're welcome Zeki, glad you liked the video!

  • @doublegdog
    @doublegdog Před 2 lety

    Great video. What do you think of folder refactoring? In some repos, I have seen people putting files/classes in a separate folder called "commons" for utility files that are used agnostically across the project. I think this would be a great idea to touch on in a future video. Nonetheless, the best python videos on youtube hands down! Keep up the great content!

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 Před 11 měsíci

    This looks more like a deep-learning project than a data-science one (using Torch, Tensorboard to follow the network training, instead of something like Pandas), which is actually exactly what I need right now, I work a lot with Pytorch and Pytorch Lightning and I'm looking to improve my code.
    The issue that I have with torch.nn.Sequential is that its annoying to debug when you have an error in your network-building lego, but if you sure that the lego is correct it is more clean to use Sequential.

  • @kobebyrant9483
    @kobebyrant9483 Před rokem +1

    Function composition is really cool and make the code very concise and clean. However, I feel like we achieve it at the cost of readability of the code and additionally make it hard to debug intermediate calculation/steps if suspect something is wrong(in reality this happens very often when there is too much math involved in the code). Some (picky) managers might not like it during code review/pull request for the reasons stated

    • @greatfate
      @greatfate Před rokem

      Exactly what I was thinking

  • @vlplbl85
    @vlplbl85 Před 2 lety

    Great stuff

  • @Astana1337
    @Astana1337 Před 2 lety

    I like to use multiple inheritance for string Enum classes. For example:
    class MyEnum(str, Enum):
    RED = 'RED'
    BLUE = 'BLUE'
    GREEN = 'GREEN'
    *Make sure the str comes first.
    Then you can use the class like normal, MyEnum.RED, and you can also use a string literal. It avoids the need to use the 'name' attribute. Lastly you also get equality if you are comparing the enum to a string literal.

  • @nicolabombace2004
    @nicolabombace2004 Před 2 lety

    As always a great video! The only suggestion I would add is maybe to turn off Intellisense for the video, because all the red squiggly lines are a bit overwhelming and actually useless because the code works!

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      Thanks for the tip! I might do that for future refactorings (at least in the beginning :) ).

  • @canvasbagfight
    @canvasbagfight Před 2 lety

    I’ve written a lot of spaghetti code to process scientific data. It’s usually so bad that it just stays as a notebook that’s copied over and laboriously edited for each new time I repurpose it. Really think this is useful content. More please.

  • @igordemetriusalencar5861
    @igordemetriusalencar5861 Před 2 lety +1

    The most important thing I've learned (I'm still learning) is to write good, cleaner, and reproducible data science code was: "Functional programming paradigm". R (with tidyverse, and tidymodel approach), and Julia programming language made me code almost like I was using a "General System Theory" from Bertalanffy, (ins -> transformations -> outs). With this approach, I can change the ins without break all the code, or I can change the functions (transformations, each one with its own rule) without break all code logic. Since I use Python only for NLP tasks I do not use a functional programming paradigm with it, but I know it is possible, maybe easier in Python (function composition was good to know it). The OO paradigm for Data Science that some data scientists use does not make any sense to me, of course, I am not a professional programmer, maybe for not having ground on computer science, I think that way. By the way, I'm learning a lot with you! Thank you very much!!!

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Thanks Igor, glad you like the content! Using pure functions is certainly a great starting point. What OO programming brings to the table is that it provides a nice mechanism for structuring data representations via (data)classes and collection objects such as lists, dicts, and so on. Ideally, you'd have a marriage of both that provides a clear structure of the data, and has data manipulation pipelines with very limited coupling and side effects.

    • @igordemetriusalencar5861
      @igordemetriusalencar5861 Před 2 lety

      @@ArjanCodes Thank you! I will try to apply this approach to my NLP study codes, I know I have a lot to learn to be able to understand OO stuff, classes, dataclasses, but your videos are helping me a lot.

  • @TheGagman2000
    @TheGagman2000 Před 2 lety +13

    Reiterating the others, very useful video for data scientists!
    I liked the idea of replacing the nested call with the compose function, but what about an "apply" function instead ?
    def apply_composition(x, *functions):
    for func in functions:
    x = func(x)
    return x
    For me, this seems easier to read than the functools solution... and its similar to the idea of a torch.nn.ModuleList container in Pytorch

    • @jessicameneguel4954
      @jessicameneguel4954 Před 2 lety +3

      This way you are replacing x as f(x) in the same fashion as the original implementation.

  • @DS-tj2tu
    @DS-tj2tu Před 2 lety

    Thank you

  • @gustavojuantorena
    @gustavojuantorena Před 2 lety

    Awesome! I think there are few tutorials about software design topics for data science.

  • @tehdusto
    @tehdusto Před rokem

    27:07
    yo dog I heard you like lambda functions, so I put a lambda function in your lambda function so you can function while you function.
    ...but really this function composition business is actually breaking my mind. I'll need to practice this one.

  • @BjarneThorsted
    @BjarneThorsted Před 2 lety +1

    Next time, you should definitely do a tensorflow/keras project. Would love to see how you would go about cleaning up the code in a project like that. full disclosure: I've written a very convoluted DL project with tf.keras and I'm 100% positive it can be written better

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      Great suggestion! Feel free to submit your code as a Code Roast, and I'd be happy to take a look if it's something I can cover on the channel.

    • @BjarneThorsted
      @BjarneThorsted Před 2 lety

      @@ArjanCodes I will try and see if I can package it up in a meaningful way. Right now it is split across two private github repos and trains on a rather large and proprietary image dataset

  • @esteenbrink
    @esteenbrink Před 2 lety

    Sponsored by 'basically'.
    Just kidding, great content. Keep it up.

  •  Před 2 lety

    I loved this video. It was the best momento to apply the design solid principles to data science because I work with it at daily base. Could you apply solid principles to panda's library because this is the most used library for data processing? Again, Thank you very much!!

  • @jimogren6306
    @jimogren6306 Před 2 lety +2

    Great video! One thing that I did not quite understand: when you changed the ExperimentTracker from an abstract base class into a protocol then the TensorboardExperiment no longer inherits from ExperimentTracker. I do not see the connection between the two classes anymore. After the refactor, to me ExperimentTracker seems like an unused class. Or am I missing something?

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +3

      After changing the ExperimentTracker to a Protocol class, the inheritance relationship between it and TensorboardExperiment is indeed gone. However, ExperimentTracker is used in the Runner class where it defines the interface that is expected for connecting the Runner with the experiment tracker. The result is that you can now create other experiment tracking classes that integrate seamlessly with the Runner class, as long as they implement the methods defined in ExperimentTracker.

  • @hudabdulwahab2499
    @hudabdulwahab2499 Před rokem

    this video is amazing - can we please get another data science / ml pipeline refactor?

  • @esteenbrink
    @esteenbrink Před 2 lety

    At 14:25 you decide to remove the protocol inheritance, making it implicit. There is no difference to the working of the code, though it does make life harder for anyone needing to change and understand this class, for it is not clear anymore that it should adhere to the protocol.

  • @sombrero7935
    @sombrero7935 Před 2 lety +2

    The one issue I have with this design is that is based solely on pytorch, so if you like to go to another framework such as tensorflow, this will require quite a bit of refactoring (without taking into account the new framework coding stuff), thus most likely making breaking changes to consumers that use the project

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +5

      In general, this is a really hard problem to solve. Especially since most frameworks like Pytorch, TensorFlow, etc. ask you to "marry" the framework and use their data types all over the place, which then makes it hard to replace the framework with something else. I'll look into this and try to come up with some ideas to do a video about this.

  • @justfoundit
    @justfoundit Před 2 lety +3

    Using the Sequential is 1 way, and it works nicely when the model has a linear flow, however if you want to build a model with - for example - 2 outputs that's sitting on different levels of the model you need to use the non-sequential way, and then the X for all intermediate stage starts to make sense :)

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      In this case I would prefer to have a class for defining an Acyclic Directed Graph. Perhaps PyTorch also has this... I didn't check.

  • @mhFFFFFF
    @mhFFFFFF Před 2 lety

    Maybe already answered, but does Pandas have function composition (aka network or sequential)? IMO this is a huge benefit of using the R tidyverse, the %>% command is called a “pipe” but it seems to work exactly like function composition and is extremely well-supported and flexible.

  • @felipealvarez1982
    @felipealvarez1982 Před 2 lety

    I would love to know about the vscode keyboard shortcuts you love the most

  • @TimGrob
    @TimGrob Před 2 lety

    Overwriting the 'forward' function in the Torch Model and updating the state (tensor) of the neural network at each step is actually the recommended way to do it by PyTorch.

  • @ravenecho2410
    @ravenecho2410 Před 2 lety

    okay catching up on vids 😋

  • @kazmkazm9676
    @kazmkazm9676 Před rokem

    Thanks for your great contents.
    However, I didn't find your custom composition function useful. However, PyTorch's Sequential or Scikit Learn's Pipeline seem more proper.

  • @matthewtaruno
    @matthewtaruno Před 2 lety +2

    One point to consider from a data scientist: a lot of the times we like quick and dirty iterations to our exploratory and predictive insights. Many times (especially under time constraints) quick and dirty is better than slow and beautiful. That's why I personally love notebooks. As long as it is idempotent (notebook runs from start to end without issues) and the environment is containerized, it is reproducible. But I see the merit for both. There is a lot of power in writing scalable and reusable code in this space to organize to complex pipelines that supercharge society's solutions. This is why, over time, I now have learned to use a hybrid of both - but maybe not in the most optimal or well-principled way.
    Which leads to my suggestion! Would you be able to make a video on how you would use Jupyter Notebooks/Kaggle Kernel Notebooks/Google Collab Notebooks in tandem with with an internal packaged up repository as you have it in the video for DS projects? Maybe this means just maintaining your currently directory structure as shown in this video but adding a "notebooks" folder to the root folder where all that type of analysis is done since we can call your modules from that notebooks folder (not sure how this would be manifested, you probably have a better idea). You use .py scripts for most things that you can install these scripts as modules for use in other scripts or even notebooks, and that is what I have been doing to keep my notebooks cleaner. But I am sure your perspective on how to have fast iteration times to high value insights, maintain a scalable pipeline, yet keep everything reusable in doing this kind of work - even maybe some sort of generalized approach shown through a video example - would be invaluable. I think this would be a game changer for myself and a lot of people in DS and ML.
    As for this video, your other content has been useful, but seeing it directly applied to the type of work I do on a regular basis brings your concepts to life for me. Please keep these software design principles applied to DS crossover content coming! Thank you for what you do :)

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Thanks and great suggestion regarding the combination of notebooks with running python scripts in a repository. I'll look into it!

  • @DuyTran-ss4lu
    @DuyTran-ss4lu Před 2 lety

    Great

  • @gercius
    @gercius Před rokem

    You are the Bob Ross of coding

    • @ArjanCodes
      @ArjanCodes  Před rokem

      Thanks Gercius, happy you’re enjoying the content!

  • @jakobullmann7586
    @jakobullmann7586 Před 11 měsíci

    It’s an interesting video, but I think it’s actually misguided advice for Data Science/ML projects. Data Science projects have a different dynamics from software engineering projects, hence the need for MLOps platforms. Tracking is needed in the experimentation stage, when things change quickly, and writing abstractions to become independent of a particular experiment tracking platform is not creating value for anyone.
    What’s actually important is that the experimentation code is decoupled from the model code (which is why Tensorflow and LightGBM use callbacks… PyTorch doesn’t, but PyTorch Lightning does, which is why I would always use PyTorch Lightning and not raw PyTorch). Moreover, where I feel abstractions are really powerful is for the model itself, because I’m order to do model selection I may have to apply a fair evaluation to models that utilize different frameworks (e.g. PyTorch vs LightGBM) or even different problem framings. The first point is what MLflow Models tries to accomplish.

  • @EW-mb1ih
    @EW-mb1ih Před 2 lety

    Except using protocol instead of ABC, your video is nice :)
    Protocol makes things less clearer.
    Silly question: why do we need to avoid storing intermediate results in the same variable?

  • @songokussj4cz
    @songokussj4cz Před 2 lety

    Hi Arjan. Love your stuff. Would you be able to create comprehensive video about "How to structure bigger project"? I've got task to create PySide2 application with at least 3 windows (Main, Settings, Results) and I'm not sure how to structure it so it's not inside one file because that's just too much of a chaos. How to connect signals to what functions and where to write them, shoul each window (code) be individual file, how to connect everything, how to parse variable from one window to second?

  • @davidoh6342
    @davidoh6342 Před 2 lety

    How do you handle errors if one of the composition function raises error?

  • @rshelansky
    @rshelansky Před 2 lety +1

    Thanks for these videos they have been fun to watch. I see the benefit of function composition, however, In practice (data science) when composing functions I have never not had a whole slew of unique parameters and contexts to pass to each function along the chain. Is there an equally elegant solution to this problem.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +4

      Hi Robert, good question. I like using either closures for this or partial functions (from functools). For example with closures, you can define a function (with parameters, contexts, etc) that returns another function and then that's the function that's passed to the composition. In terms of the example in this video at the end, you could do the following, where n is an extra parameter, add_n is a closure that returns a function:
      def add_n(n: int):
      def add(x: int):
      return x + n
      return add
      ...
      compose(add(5), add(12), multiplyByTwo, ...)

    • @mathmo
      @mathmo Před 2 lety

      @@ArjanCodes Robert, not sure whether @ArjanCodes would approve of this, but you could define a Callable ABC base class for your functions that implements a __rmul__ (or sth like that) method that you implements function composition for the __call__ methods and initialize the instances with whatever parameters you want that are not part of the functional input data. And if you make the __call__ method accept and return a dict you can also compose functions of different arities.

  • @smalltimer666
    @smalltimer666 Před 2 lety +1

    Hi Arjan, I write a lot of models and I wanted to ask if you have tips regarding what I imagine is a very simple issue. Version hell. I write code on multiple machines, using multiple styles: jupyter notebooks, org buffers, and of course scripts. Everything is almost always contained in a pipenv environment. But when I try to pipenv install on different machines I keep getting all sorts of version-related errors. I think I am missing some key insight here. There is no way python has such a sloppy design :D Any tips will be really appreciated!

    • @cajmrn1
      @cajmrn1 Před 2 lety

      DVC, mlflow, and/or kedro. will change your life. they changed mine :).

  • @mnsosa
    @mnsosa Před 2 lety

    Where can I learn professional Machine Learning design projects? All I found is Jupyter Notebooks, but I want to do it more professional.

  • @christiencodes3086
    @christiencodes3086 Před 2 lety

    Do you have Kite installed for autocomplete ?

  • @RichardVodden1
    @RichardVodden1 Před 2 lety

    Would you ever consider overriding `__str__` on an Enum to return `self.name`? That would avoid having to add `stage.name` in all those f-strings. Feels neat to me from a code repetition perspective, but it does violate the "Explicit is better than Implicit" guidance of the zen of python. I'd be really interesting in your opinion.

    • @ArjanCodes
      @ArjanCodes  Před 2 lety +1

      Great suggestion, and I think it works really well in this particular case.

  • @Julien-hg8jh
    @Julien-hg8jh Před 2 lety

    23:50 yield is it also a good solution ?

  • @atillakoseoglu4089
    @atillakoseoglu4089 Před 2 lety

    Dear Arjan,
    I am a 3 months of rookie in python(learned classes , functions basics etc)
    And interested in data things , not development 🙀
    Is it a problem you think? I mean to find a job and career-wise
    Thank for your kind answers and advices
    🙏

  • @_veikkomies
    @_veikkomies Před 2 lety

    How can Tensorboard do anything using the experiment tracker class since you removed the inheritance and I can't see how the two classes are linked any more. What's the point of the experiment tracker class now?

    • @ArjanCodes
      @ArjanCodes  Před 2 lety

      That’s the whole idea of protocols. The relationship no longer exists between superclasses and subclasses, but you use protocols to define the interface at the place where it’s needed and Python’s structural typing system then does the type checks. So in this example, the goal of the experiment tracker protocol class is not to act as a superclass, but to act as an interface of the part of the code that uses it, here that’s the main file and the Runner class.

    • @_veikkomies
      @_veikkomies Před 2 lety

      @@ArjanCodes Ahh thank you

  • @some84884
    @some84884 Před 2 lety +1

    Debug of functions composition it's painful. It's much better to have variables with unique names between calls

  • @Booyah
    @Booyah Před rokem

    Why do you switch from showing the code you're discussing, to showing yourself full screen and removing the code from view?