WE MUST ADD STRUCTURE TO DEEP LEARNING BECAUSE...

Machine Learning Street Talk

zhlédnutí 74 001

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 12. 05. 2024
Dr. Paul Lessard and his collaborators have written a paper on "Categorical Deep Learning and Algebraic Theory of Architectures". They aim to make neural networks more interpretable, composable and amenable to formal reasoning. The key is mathematical abstraction, as exemplified by category theory - using monads to develop a more principled, algebraic approach to structuring neural networks.
We also discussed the limitations of current neural network architectures in terms of their ability to generalise and reason in a human-like way. In particular, the inability of neural networks to do unbounded computation equivalent to a Turing machine. Paul expressed optimism that this is not a fundamental limitation, but an artefact of current architectures and training procedures.
The power of abstraction - allowing us to focus on the essential structure while ignoring extraneous details. This can make certain problems more tractable to reason about. Paul sees category theory as providing a powerful "Lego set" for productively thinking about many practical problems.
Towards the end, Paul gave an accessible introduction to some core concepts in category theory like categories, morphisms, functors, monads etc. We explained how these abstract constructs can capture essential patterns that arise across different domains of mathematics.
Paul is optimistic about the potential of category theory and related mathematical abstractions to put AI and neural networks on a more robust conceptual foundation to enable interpretability and reasoning. However, significant theoretical and engineering challenges remain in realising this vision.
Please support us on Patreon. We are entirely funded from Patreon donations right now.
/ mlst
If you would like to sponsor us, so we can tell your story - reach out on mlstreettalk at gmail
Links:
Categorical Deep Learning: An Algebraic Theory of Architectures
Bruno Gavranović, Paul Lessard, Andrew Dudzik,
Tamara von Glehn, João G. M. Araújo, Petar Veličković
Paper: categoricaldeeplearning.com/
Symbolica:
/ symbolica
www.symbolica.ai/
Dr. Paul Lessard (Principal Scientist - Symbolica)
/ paul-roy-lessard
Neural Networks and the Chomsky Hierarchy (Grégoire Delétang et al)
arxiv.org/abs/2207.02098
Interviewer: Dr. Tim Scarfe
Pod: podcasters.spotify.com/pod/sh...
Transcript:
docs.google.com/document/d/1N...
More info about NNs not being recursive/TMs:
• Can ChatGPT Handle Inf...
Geometric Deep Learning blueprint:
• GEOMETRIC DEEP LEARNIN...
TOC:
00:00:00 - Intro
00:05:07 - What is the category paper all about
00:07:19 - Composition
00:10:42 - Abstract Algebra
00:23:01 - DSLs for machine learning
00:24:10 - Inscrutability
00:29:04 - Limitations with current NNs
00:30:41 - Generative code / NNs don't recurse
00:34:34 - NNs are not Turing machines (special edition)
00:53:09 - Abstraction
00:55:11 - Category theory objects
00:58:06 - Cat theory vs number theory
00:59:43 - Data and Code are one and the same
01:08:05 - Syntax and semantics
01:14:32 - Category DL elevator pitch
01:17:05 - Abstraction again
01:20:25 - Lego set for the universe
01:23:04 - Reasoning
01:28:05 - Category theory 101
01:37:42 - Monads
01:45:59 - Where to learn more cat theory
Věda a technologie

Komentáře • 252

@deadeaded Před měsícem ⁺¹⁹¹
I'm slightly embarrassed at how excited I got when I saw the natural transformation square in the thumbnail...
@MDNQ-ud1ty Před měsícem ⁺⁶
You shouldn't, it is not a natural transformation square.
@deadeaded Před měsícem ⁺⁵
@@MDNQ-ud1tyWhat do you mean? It sure looks like a naturality square to me.
@tomhardyofmaths2594 Před měsícem ⁺¹
Right?? I was like 'Oooh what's this?'
@PlayerMathinson Před měsícem ⁺⁹
@@MDNQ-ud1ty Yes that is the natural transformation square. alpha is natural transformation and F is functor
@MDNQ-ud1ty Před měsícem
@@PlayerMathinson If you interpret it that way, ok then. But that is what mathematicians call a commuting square, not a natural transformation square.
A natural transformation, while being what you said(a map between two functors), is written as differently(a double arrow between two functor arrows, a globular 2-morphism).
The way I see it is that it is just the components of a natural transformation... at least potentially since of course we have to guess exactly what the other symbols.
Basically:
ncatlab org nlab show natural+transformation
The first diagram is the natural transformation. The second is the commuting *square*(and which looks where he copied it from, so to speak) which is talking about the components.
The reason the square, in my mind, is not technically a natural transformation is that a natural transformation requires it to be true for all morphisms and hence the different notation. Basically the square is a commuting square(assuming things commute) in the functor category. That may or may not be a component of some natural transformation(there may be no natural transformation between F and G).
So to call it a natural transformation seems to me to be a bit loose with terminology.
@johntanchongmin Před měsícem ⁺⁴⁴
I like Dr. Paul's thinking - clear, concise and very analytical. LLMs don't reason, but they can do some form of heuristic search. When used on some structure, it can lead to very powerful search over the structure provided and increase their reliability.
@andersbodin1551 Před měsícem ⁺⁵
More like some kind of compression of training data
@AliMoeeny Před měsícem ⁺¹⁴
Yet another exceptionally invaluable episode. Thank you Tim
@aitheignis Před měsícem ⁺¹⁴
This is an amazing video. I really love this tape. The idea about building formal language based on category theory to reason about some systems isn't limited to just application in neural network for sure. I can definitely see this being used in gene regulatory pathway. Thank you for the video, and I will definitely check out the paper.
@erikpost1381 Před měsícem ⁺¹
For sure. I don't know anything about the domain you mentioned other than that it sounds interesting, but you may be interested to have a look at the AlgebraicJulia space.
@jabowery Před měsícem ⁺²¹
Removing the distinction between a function and data type is at the heart of Algorithmic Information. AND gee guess what? That is at the heart of Ockham's Razor!
@stretch8390 Před měsícem ⁺³
I haven't encountered this before so have a basic question: in what way is removing distinction between function and data type different to having first class functions?
@walidoutaleb7121 Před měsícem ⁺³
@@stretch8390 no difference its the same thing. in the original sicp lecture they are talked about interchangeably
@stretch8390 Před měsícem
@@walidoutaleb7121thanks for that clarification.
@jabowery Před měsícem
@@stretch8390 Think about 0 argument functions (containing no loops and that call no other functions) as program literals. The error terms in Kolmogorov Complexity programs (the representation of Algorithmic Information) are such functions.
@luisantonioguzmanbucio245 Před měsícem
Yes! In fact typed lambda calculus or other type systems eg. Calculus of Inductive Constructions and so on, functions have a type. Some of these type systems also serve as a foundation of mathematics, including Homotopy type theory, discussed in the video.
@oncedidactic Před měsícem ⁺¹
Great stuff! I enjoyed Paul’s way of talking about math - first the precise definition and then why do we care, part by part. Good work dragging it out until the pump primed itself 😅
@thecyberofficial Před měsícem ⁺⁹
As an abstract handle theorist, everything is my nail, my screw, my bolt, ... :)
Often, the details thrown away by categorisation are exactly what matters, otherwise you just end up working with the object theory in the roundabout Cat (or Topoi) meta-language.
@radscorpion8 Před 29 dny
YOU THINK YOU'RE SOOOO SMART....and you probably are
@MDNQ-ud1ty Před 14 dny
Details matter. Without details there isn't anything. No one is throwing out details in abstraction, they are abstracting details. That is, generalizing and finding the common representation for what generates the details or how to factor them into common objects that are general.
Category theory isn't really anything special in the sense that humans have been doing "Category theory" or thousands of years. What makes formal category theory great is it gives the precise tools/definitions to deal with complexity.
I'm really only talking about your use of the word "throw away" as it as connotations that details don't matter when, in fact, details matter. One of the biggest problems in complexity is being able to operate at the right level of detail at the right time while not losing other levels of detail. When you lose "detail" you can't go back(non-invertible).
Because mathematics rely so heavily on functions and functions are usually non-injective this creates loss of detail(two things being merged in to one thing without a way to "get back"). This can be beneficial because of finite time and resources if one can precisely "throw away" the detail one doesn't need but usually if one has to "get back" it becomes an intractable problem or much more complicated.
I think the main benefit of modern category theory is that it makes precise how to think about things rather than having that vague idea that there is a "better way" but not really understand how to go about doing it.
In fact, much of formal category theory is simply dealing with representations. So many things exist in our world(so many details) that are really just the same thing. Being able to go about determining such things in a formal process makes life much easier, specially when the "objects" are extremely complex. Category theory allows one effectively to treat every layer of complexity as the same(the same tools work at every layer).
@jumpstar9000 Před měsícem ⁺⁷
With regard to inscruitability around the 26 minute mark. My personal feeling is that the issue we face is with overloading of models. As an example, let's take an LLM. Current language models take a kitchen sink approach where we are pressing them to generate both coherent output and also apply reasoning. This doesn't really scale well when we introduce different modalities like vision, hearing or the central nervous system. We don't really want to be converting everything to text all the time and running it through a black box. Not simply because it is inefficient, but more that it isn't the right abstraction. It seems to me we should be training multiple models as an ensemble that compose from the outset where we have something akin to the pre-frontal cortex that does the planning in response to stimuli from other systems running in parallel. I have done quite a bit of thinking on this and I'm reasonably confident it can work. As for category theory and how it applies. If I squint I can kind of see it, but mostly in an abstract sense. I have built some prototypes for this that I guess you could say were type safe and informed by category theory. I can see it might help to have the formalism at this level to help with interpretability (because that's why I built them). Probabalistic category theory is more along the lines of what I have been thinking.
@tomaszjezak0 Před měsícem
Would love to hear more about the brain approach
@chrism3440 Před 18 dny
The concept of orchestrating multiple specialized models is intriguing and aligns with distributed systems' principles, where modularity and specialization reign. Hierarchical orchestration could indeed create an efficient top-down control mechanism, akin to a central nervous system, facilitating swift decision-making and prioritization. However, this might introduce a single point of failure and bottleneck issues.
On the other hand, a distributed orchestration approach, inspired by decentralized neural networks, could offer resilience and parallel processing advantages. It encourages localized decision-making, akin to edge computing, allowing for real-time and context-aware responses. This could also align with principles of category theory, where morphisms between different model outputs ensure type safety and functional composition.
Yet, I wonder if a hybrid model might not be the most robust path forward. This would dynamically shift between hierarchical and distributed paradigms based on the task complexity and computational constraints, possibly guided by meta-learning algorithms. Such fluidity might mirror the brain's ability to seamlessly integrate focused and diffused modes of thinking, leading to a more adaptable and potentially self-optimizing system.
The implications for AI ethics and interpretability are profound. A hybrid orchestration could balance efficiency with the robustness of diverse inputs, potentially leading to AI systems whose decision-making processes are both comprehensible and auditable. Probabilistic category theory might play a vital role in this, offering a mathematically grounded framework to manage the complexity inherent in such systems.
@Daniel-Six Před 29 dny
This was an incredibly good discussion. Tim and company are definitely on to something elusive to articulate but crucial to appreciate regarding the real limitations of current machine "intelligence," and I can at least vaguely fathom how this will be made clear in the coming years.
@derricdubois1866 Před 29 dny ⁺²
The point of abstraction is to enable one to achieve a view of some particular forest by avoiding being blinded to such by the sight of some trees.
@jonfe Před měsícem ⁺¹
The guy talking about external read/write memory for improving AI is right for me, I was thinking exactly the same and have been developing a model that have a kind of memory for a timeseries problem, getting a lot of improvement in predictions.
@mapleandsteel Před měsícem ⁺²
Claude Lévi-Strauss finally getting the respect he deserves
@u2b83 Před měsícem ⁺¹
40:03 This is why I suspect NNs operated iteratively produce better results (e.g. stable diffusion, step by step reasoning, etc...). However finite recursion appears to be good enough in practice. In SAT problems you can pose recursive problems by unrolling the recursion loop, enabling proving properties of programs up to a certain size.
@consumeentertainment9310 Před měsícem ⁺¹
Amazing Talk!!
@erikowsiak Před měsícem ⁺³
I love your podcasts it seems you get all the right people to talk to :) just when I needed it :)
@adokoka Před měsícem ⁺¹⁹
I believe Category Theory is the route to uncover how DNN and LLM work. For now, I think of a category as a higher level object that represents a semantic or topology. Imagine how lovely it would be if LLMs could be trained on categories possibly flattened into bytes.
@MikePaixao Před měsícem
Nah, Numbers theory and fractal logic is where it's at :)
@adokoka Před měsícem
@@MikePaixao It depends on the application.
@blackmail1807 Před měsícem ⁺²
Category theory isn’t a route to anything, it’s just the language of modern math. You can do whatever you want with it.
@grivza Před měsícem ⁺¹
@@blackmail1807You are ignoring the role of language in leading your prospective formulations. For a naive example try doing some calculations using the Roman numerals.
@MikePaixao Před měsícem ⁺¹
@@adokoka the problem with always relying on other people's theories is that you basically dead end your own creativity, my solutions to Ai have ended up looking like bits and pieces of a multitude of theories, but you honestly don't need any math or knowledge of existing models. By recreating or reverse engineering reality as a ground truth, you skip all the existing biases and limitations of existing solutions 🙂
I like to solve problems to truly understand the why they behave the way they do, I ask myself "why q* is effecient?" "Why does you know why converting to -101 can recreate 16bit float models precision?" I discovered all those systems last year when I reverse engineered how NERFS and GPT think and see the world -> then then did my own interpretation afterwards 🙃
@user-wv9pw9tq1g Před měsícem ⁺²
Great discussion that gets into the weeds. Love the software engineer’s point of view. Only thing missing from Dr. Lessard is an ascot and a Glencairn of bourbon - because he wouldn’t dare sip Old Rip Van Winkle from a Snifter.😂
@2bsirius Před měsícem ⁺¹
All they need is a membership card for admission to Jorge Borges' infinite library. I'm sure the resolution to this riddle is in one of the books in there somewhere.
@hi-literyellow4483 Před 18 dny
The british engineer is spot on. Respect sir for your clear vision and clarification of the BS sold by Google marketeers.
@srivatsasrinivas6277 Před měsícem ⁺²
I'm skeptical about composability explaining neural networks because small neutral networks do not show the same properties as many chained together. Composability seems like a useful tool once the nets you're composing are already quite large.
I think that the main contribution of category theory will be providing a dependent type theory for neural net specification.
The next hype in explainable AI seems to come from the "energy based methods".
@wanfuse Před 13 dny
what an education one gets watching you guys! Thanks! on the stopping condition, why not stop on a proximity distance from the stop condition instead of exact? Trying iteratively can tell you what the limit of proximity is?
@pierredeloince9073 Před měsícem
Thank you, how interesting 🤝
@davidrichards1302 Před měsícem
Should we be thinking about "type adaptors"? Or is that too object-oriented?
@lincolnhannah2985 Před měsícem
LLMs store information in giant matrices of weights. Is there any model that can process a large amount of text and creat a relational database structure where the tables and fields are generated by the model as well as the data in them.
@jonfe Před měsícem ⁺¹
Reasoning for me is like having a gigant graph of "things" or "concepts" in your brain, learning the relationships between them thru experience, for example you can relate parts of an event to different one, just by finding correlations in the relationships between their internal parts, and doing that you can pass the learning of one event to the other.
@sirkiz1181 Před 29 dny
Yeah which makes sense considering the structure of your brain. This sort of structuring is clearly the way forward but as a newcomer to AI it’s unclear to me how easy it is for AI and computers to understand concepts in the way that it is so intuitive for us and what kind of program would make that sort of understanding and subsequent reasoning possible
@mrpocock Před 29 dny ⁺¹
I kind of feel machine learning has a few foundational issues that you can only brute force by for so long. 1) as they say, there's no proper stack mechanism, so there are whole classes of problems that it can't actually model correctly but can only approximate special cases of. 2) the layers of a network build up to fit curves, but there's no proper way to extract equations for those curves and then replace that subnet with that equation. Including flow-control, so we are left with billions of parameters that are piece-wise fitting products and sine waves and exponentials and goodness knows what as complex sums of sums.
@asdf8asdf8asdf8asdf Před měsícem ⁺⁵
Dizzying abstract complexity surfing on a sea of reasonable issues and goals.
@jumpstar9000 Před měsícem
On the recursion topic, there was a little bit of confusion in the discussion. On the one hand there was something about language models not understanding recursion, but more key, they have trouble using recursion while producing output. Clearly LMs can write recursive code and even emulate it to some degree. In any case, it is possible to train an LM with action tokens that manipulate a stack in a way resembling FORTH and get full recursion. It may be possible to add this capability as a bolt-on to an existing LM via fine-tuning. Having this would expand capabilities no end, providing not just algorithmic execution but also features like context consolidation and general improvements to memory, especially if you also give them a tuple store where they can save and load state... yes, exactly, you said it.
@deadeaded Před měsícem ⁺²
Being able to write recursive code is totally irrelevant to what's going on under the hood. That's true in general. GPT can write the rules of chess, for example, but it cannot follow them. Don't be fooled into thinking that LLMs understand their output.
@jumpstar9000 Před měsícem ⁺¹
@@deadeaded Yes, I was pointing out that there was initially some confusion in the discussion with regard to this.
@colbynwadman7045 Před měsícem ⁺³
They should stop interrupting the speaker with random questions since it’s super annoying.
@colbynwadman7045 Před měsícem ⁺²
Both branches in an if expression in Haskell have to be of the same type. There are no union types like in other languages.
@tonysu8860 Před měsícem
In the segment NNs are not Turing machines, a lot of discussion seemed to be about how to limit recursive search and possibly that Turing machines are not capable of recursive functionality.
I'm not a data scientist but have read the published AlphaZero paper and am somewhat familiar how that technology is implemented in Lc0.
I've never looked at how that app terminates search but it's reasonable to assume it's determined by the parameters of gameplay.
But I would also assume that limitation can be determined by other means, the observation that a "bright bit" might never light up is true but only if you think in an absolute sense which is generally how engineers think, in terms of precise and accurate results. I'd argue that problems like this requires a change of thinking more akin to quantum physics or economics where accuracy might be desirable if achievable but is more often determined by predominance when the answer is good enough if all the accumulated data and metadata suggests some very high level but not yet exact accuracy. Someone if not the algorithm itself has to set that threshold to illuminate that bright bit to signal the end to search and produce a result.
@MrGeometres Před 16 dny
10:04 "Code is Data" is especially clear in Linear Algebra. A vector |v⟩ is data. A function is code. But a vector also canonically defines a linear function: x ↦ ⟨v∣x⟩.
@darylallen2485 Před měsícem ⁺¹
1:57 - Its been several years since took calculus, but I remember being exposed to some functions that calculated the area of a shape where the domain of the function was negative infinity to positive infinity, but the area was a finite number. Mathematically, seems it should be possible to achieve finite solutions with infinite inputs.
@lobovutare Před měsícem
Gabriel's horn?
@darylallen2485 Před měsícem
@@lobovutare Thats certainly one example.
@dhruvdatta1055 Před měsícem
in my opinion, the curve shape function, that we integrate can be considered a single inpuy
@AutomatedLiving09 Před měsícem ⁺⁹
I feel that my IQ increases just by watching this video.
@debunkthis Před měsícem ⁺¹
It didn’t
@chadx8269 Před měsícem
Professor Van Nostram do you allow questions?
@alvincepongos Před 27 dny
Say you apply category theory on NNs and you do find a geometric algebra that operationally formalizes the syntax and semantics of the system. Is it possible that the resulting algebra is exactly what's built in, compositions of activated linear equations? If that is the case, no insights are gained. To prevent this problem, how are CT/ML scientists posing the approach such that category theory's insights are deeper than that?
@bwhit7919 Před měsícem ⁺¹¹
Damn this is the first podcast I couldn’t just leave on 2x speed
Edit nvm it was just the first 5 min
@davidallen5146 Před měsícem
I think the future of these AI systems should be structured data in and out. This would support the concept of geometric deep learning as well as AI systems that can be more understandable and composable with each other and with traditionally programmed systems. This would also support the generation and use of domain specific interfaces/languages. We they also need is the ability to operate recurrently on these structures. This recurrence can occur internally to the AI systems, or as part of the composition of AI's.
@stacksmasherninja7266 Před měsícem ⁺⁴
what was that template metaprogramming hack to pick the correct sorting algorithms? any references for that please? sounds super interesting
@nomenec Před měsícem
Any chance you can join our MLST Discord (link at the bottom of the description), and send me (duggar) a mention from the software-engineering channel? We can better share and discuss there.
@nomenec Před měsícem ⁺²
Not sorting but here is an example from my recent code of providing two different downsample algorithms based on iterator traits:
// random access iterators
template < typename Iiter, typename Oiter >
auto downsample (
Iiter & inext, Iiter idone,
Oiter & onext, Oiter odone
) ->
typename std::enable_if< std::is_same<
typename std::iterator_traits::iterator_category,
std::random_access_iterator_tag
>::value, void>::type
{
// ...
}
// not random access iterators
template < typename Iiter, typename Oiter >
auto downsample (
Iiter & inext, Iiter idone,
Oiter & onext, Oiter odone
) ->
typename std::enable_if< !std::is_same<
typename std::iterator_traits::iterator_category,
std::random_access_iterator_tag
>::value, void>::type
{
// ...
}
For very cool algebraic group examples check out Chapter 16 of "Scientific and Engineering C++: An Introduction With Advanced Techniques and Examples" by Barton & Nackman.
@andreismirnov69 Před měsícem
original paper by Stepanov and Lee: citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=658343dd4b5153eb59f834a2ec8d82106db522a8
Later it became known as STL and ended up as part of C++ std lib
@your_utube Před měsícem
In my view, with my limited knowledge, I think that the conversation about quantifying and classifying the primitives of ANNs should have been done by now and at least recording what has already now been learned over the last 2 decades into a format that allows you to merge it with existing systems is a given minimum. I ask myself whether one can explain existing ways to do computation in terms of the primitives of the ANN system that are popular now. In other words can we transform one process into another and back to at least prove what the limits and capabilities of the new ways are in terms of the well-known.
@transquantrademarkquantumf8894 Před 14 dny
Nice edits great hi-speed symmetry
@srivatsasrinivas6277 Před měsícem
I think that specificity is as important as abstraction
Domain specific languages and programs mutually justify each other's existence
@andrewwalker8985 Před měsícem ⁺³³
How many people started watching this and feel like your passion for AI somehow tricked you into getting a maths degree
@Walczyk Před 28 dny
i got my degree before ai so no but i’m more interested now in algebraic geometry
@captainobvious9188 Před 27 dny ⁺¹
I almost finished my degree in Math back in the 2000s for this reason, but medically got derailed and never made it back. I hope to get back someday!
@KunjaBihariKrishna Před 25 dny ⁺⁴
"passion for AI" I just vomited
@andrewwalker8985 Před 25 dny
@@KunjaBihariKrishna lol fair enough
@drdca8263 Před 28 dny ⁺¹
39:28 : small caveat to the “quantum computers can’t do anything a Turing machine can’t do” statement: while it is true that any individual computation that can be done by a quantum computer, can be done with a Turing machine (as a TM can simulate a QC), a quantum computer could have its memory be entangled with something else outside of it, while a Turing machine simulating a quantum computer can’t have the simulated quantum computer’s data be entangled with something which exists outside of the Turing machine. This might seem super irrelevant, but surprisingly, if you have two computationally powerful provers who can’t communicate with each-other, but do have a bunch of entanglement between them, and there is a judge communicating with both of them, then the entanglement between them can allow them to demonstrate to the judge that many computational problems have the answers they do, which the judge wouldn’t be able to compute for himself, and where the numbers of such problems that they could prove* the answer to to the judge, is greatly expanded when by their having a bunch of entanglement between them.
MIP* = RE
is iirc the result
But, yeah, this is mostly just an obscure edge case, doesn’t really detract from the point being made,
But I think it is a cool fact
53:18 :
mostly just a bookmark for myself,
But hm. How might we have a NN implement a FSM in a way that makes TMs that do something useful, be more viable?
Like, one idea could be to have the state transitions be probabilistic, but to me that feels, questionable?
But like, if you want to learn the FSM controlling the TM by gradient descent, you need to have some kind of differentiable parameters?
Oh, here’s an idea: what if instead of the TM being probabilistic, you consider a probability distribution over FSMs, but use the same realization from the FSM throughout?
Hm.
That doesn’t seem like it would really like, be particularly amenable to things like, “learning the easy case first, and then learning how to modify it to fix the other cases”?
Like, it seems like it would get stuck in a local minimum...
Hmm...
I guess if one did have a uniform distribution over TMs with at most N states, and had the distribution as the parameter, and like, looked at the expected score of the machines sampled from the distribution (where the score would be, “over the training set, what fraction of inputs resulted in the desired output, within T steps”, taking the gradient of that with respect to the parameters (i.e. the distribution) would, in principle, learn the program, provided that there was a TM with at most N states which solved the task within time T.. but that’s totally impractical. You (practically speaking) can’t just simulate all the N state TMs for T steps on a bunch of inputs. There are too many N state TMs.
Maybe if some other way of ordering the possible FSMs was such that plausible programs occurred first?
Like, maybe some structure beyond just “this state goes to that state”?
Asdf.Qwertyuiop.
Idk.
Hm, when I think about what I would do to try to find the pattern in some data, I think one thing I might try, is to apply some transformation on either the input or the output, where the transformation is either invertible or almost invertible, and see if this makes it simpler?
.. hm,
If a random TM which always halts is selected (from some distribution), and one is given a random set of inputs and whether the TM accepts or rejects that input, and one’s task is to find a TM which agrees with the randomly selected TM on *all* inputs (not just the ones you were told what it’s output is for),
how much help is it to also be told how long the secret chosen TM took to run, for each of the inputs on which you are told it’s output?
I feel like, it would probably help quite a bit?
@CharlesVanNoland Před měsícem ⁺⁶
There was a paper about a cognitive architecture that combined an LSTM with an external memory to create a Neural Turing Machine called MERLIN a decade ago. There was a talk given about it that's over on the Simons Institute's CZcams channel called "An Integrated Cognitive Architecture".
@MachineLearningStreetTalk Před měsícem ⁺¹
There are a bunch of cool architectures out there to make NNs simulate some TM-like behaviours, but, none are TMs. It's a cool area of research! It's also possible to make an NN which is like a TM which is not possible to train with SGD. I hope we make some progress here. Researchers - take up arms!
@charllsquarra1677 Před měsícem ⁺¹
@@MachineLearningStreetTalk why wouldn't it be possible to train with a SGD? after all, commands in a TM are finite actions, which can be modelled with a GFlowNet, the only missing piece is an action that behaves as a terminal state and passes the output to a reward model that feedbacks into the GFlowNet
@nomenec Před měsícem
@@charllsquarra1677 it's more of an empirical finding that as you increase the computational power of NNs, for example the various MANNs (memory augmented NNs), training starts running into extreme instability problems. I.e. we haven't yet figured out how to train MANNs for general purpose that is to search the entire space of Turing Complete algorithms rather than small subspaces like the FSA space. We might at some point, and the solution might even involve SGD. Just, nobody knows yet.
@jumpstar9000 Před měsícem ⁺¹
I'm only 8 minutes in, but it is making me nervous. The assertion regarding systems that are non-composable breaks down. Lists and Trees arent composable because of their representation paired with the choice of algorithm you are using. We already know that we can flatten trees to lists, or make list like trees.. or more abstractly introduce an additional dimension to both trees and lists that normalize their representation so we can apply a uniform algorithm. If you want to look at it from a different angle, we know that atoms form the basis of machines and therefore atoms have no problem dealing with both trees or lists. 2D images can also represent both data-types. The thing is, we don't walk around the world and change brains when we see an actual tree vs a train. Anyway, like I said I just got going. It is very interesting so far.. I'm sure all will be revealed. onward...
@FranAbenza Před měsícem ⁺¹
Is human biological machinery better understood as a functional-driven system or OO? Why? from cell to cognition?
@glasperlinspiel Před měsícem
Read Amaranthine: How to create a regenerative civilization using artificial intelligence
@R0L913 Před 26 dny ⁺¹
Not meet they are making mistakes and need fresh input. I am noting all the terms so I can learn. One of my kids is a linguist.
Another is a recruiter and must recruit/ find people who can create programming languages that fit. It’s all one exciting thing.
Remember Java, remember object oriented programming you’re important keep at it, you may create a breakthrough ❤
@hnanacc Před měsícem
why is the nature infinite? what if it's just the same things repeating but with some variance? So also a plausible assumption is there is a large amount of information to be memorized, which needs further scaling but the model can emulate the variance.
@u2b83 Před měsícem ⁺¹
34:45 This diagram is really cool. The same simple finite state controller is iterating over different data structures. The complexity of the data structures enables the recognition or generation of different formal language classes. The surprise to me is that we can use [essentially] the same state machine to drive it.
@max0x7ba Před měsícem
You don't run RNN until bit 26 lights up. Rather you run it until it produces end-of-input token.
@SLAM2977 Před měsícem ⁺¹⁰
This looks like very early stage academic research with very low prospects of a returns in the near/mid term, surprised that somebody was willing to put their money into it. Very interesting but too academic for a company, all the best to the guys.
@alelondon23 Před měsícem ⁺¹
what makes you think the returns are so far? Let me remind you "Attention is all you need" was a single paper that triggered all these APPARENT (and probably not scalable)AI capabilities producing real returns.
@SLAM2977 Před měsícem ⁺⁷
@@alelondon23 there is no tangible evidence of it being applicable in a way that leads to competitive advantage at the moment, "just" a highly theoretical paper. Attention all you need had tangible results that supported the architecture("On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."), then you can always remind me of a paper nobody cared about and years later was the solution to everything, somebody has to win the lottery...
@SLAM2977 Před měsícem
Also frankly Google can afford to throw money at anything they want, hoping that among the many results some of their research will hit jackpot.
@JumpDiffusion Před měsícem ⁺⁴
@@alelondon23Attention paper had empirical results/evidence, not just architecture…
@eelcohoogendoorn8044 Před měsícem ⁺⁶
'Early stage academic research' is a bit kind, imo. This 'lets just slather it in category theory jargon so we can sound smart' thing isnt exactly a new idea.
@andreismirnov69 Před měsícem ⁺¹
would anyone recommend textbook level publications on category theory and homotopy type theory?
@pounchoutz Před měsícem ⁺²
Elements of infinity category theory by Emily Riehl and Dominic Verity
@MikePaixao Před měsícem ⁺¹
Too funny. I've been saying transformer models all put infinity in the wrong place 😂
You can get around finite limit, but not with transformers
I would describe it like a data compression singularity :)
28:06 not that hard once you think about it for a bit, you end up with circular quadratic algebra 🙂
34:54 you can create a Turing machine and get around the Von Neumann bottlneck, then you end up somewhere near my non-transformer model 😊
@Walczyk Před 28 dny
22:48 this is exactly how SQL formed!! The earlier structure of absolute trees stopped being practical once databases grew, and industry moved on fast. this will happen here for continued progress
@dr.mikeybee Před měsícem
Semantic space is a model of human experience. Human experience is a real thing. Therefore the semantic space that is learned by masked learning is a model of a model. What intrigues me is that semantic space has a definite shape. This makes learned semantic spaces -- even in different languages -- similar.
@cryoshakespeare4465 Před měsícem
Well, I had a great time watching this video, and considering I can abstract my own experiences into the category of human experiences in general, I'd say most people who watched it would enjoy it too. Thankfully, I'm also aware that my abductive capacities exist in the category of error-prone generalisations, and hence I can conclude that it's unlikely all human experiences of this show can be inferred from my own. While my ability to reason about why I typed this comment is limited, I can, at present, formalise it within the framework of human joke-making behaviours, itself in the category of appreciative homage-paying gestures.
@shahzodadavlatova7203 Před měsícem
Can you share the andrej karpathy talk?
@carlosdumbratzen6332 Před 27 dny
as someone who only has a passing interest in these issues (because so far LLMs have not proven to be very usefull in my field, except for pampering papers), this was a very confusing watch.
@ariaden Před měsícem
Yeah. Big props to the thumbnail. Maybe I will even watch the video, some time in my future,
@dr.mikeybee Před měsícem ⁺¹
We also explicitly create abstractions in transformers. The attention heads are creating new embeddings.
@MachineLearningStreetTalk Před měsícem ⁺¹
Some abstractions are better than others. So far, we humans are rather good at creating some which machines can't learn. There are things like in-context learning or "algorithmic prompting" (arxiv.org/pdf/2211.09066.pdf) which explicitly code in certain (limited) types of reasoning in an LLM, like for example, adding numbers together out of distribution. If we could get NNs to learn this type of thing from data, that would be an advancement.
@charllsquarra1677 Před měsícem
@@MachineLearningStreetTalk I'm sure you saw Andrew Karpathy video about tokenization. TLDR; tokenization is currently a mess that is swept under the rug, it is very hard for a LLM to properly do math when some multi-digit numbers are single tokens in their corpus
@MachineLearningStreetTalk Před měsícem
I agree, tokenisation could be improved, but I don’t think it’s that big of a thing wrt learning to reason
@dr.mikeybee Před měsícem
@MachineLearningStreetTalk, Yes, we'll keep working on optimizing what we can, including for prompt engineering and injection engineering, I suppose we can view the attention heads as a case of in-context learning as we calculate similarity weights and produce a newly formed calculated context. Of course the projection matrices are also acting as a kind of database retrieval. So here something is learned in the projection matrices that results in many changes to vector values in the context signature. The built (dare I say structured) new embeddings coming out of the attention heads are "decoded" in the MLP blocks for the tasks the MLPs were trained on. Nevertheless, higher level abstractions are being learned in all the differentiable MLP blocks. I don't think that can be denied. All in all, we need to discuss the semantic learning that happens for the embedding model via masked learning. This creates a geometric high-dimensional representation of a semantic space, positional encoding for syntactical agreement, attention heads for calculating similarity scores, projection matrices for information retrieval and filtering, MLPs for hierarchical learning of abstract features and categories, and residual connections for logic filtering. Of course there are many other possible component parts within this symbolic/connectionist hybrid system, since the FFNN is potentially functionally complete, but I think these are the main parts.
@lemurpotatoes7988 Před měsícem
More intelligent, structured masking strategies would be extremely helpful IMO. I like thinking about generative music and poetry models in this context. Masking a random one out of n notes or measures doesn't necessarily let you learn all the structure that's there.
@mobiusinversion Před měsícem ⁺⁴
Apologies for the pragmatism, but is this applicable in any realistic engineering driven effort?
@oncedidactic Před měsícem
Well I think the jumping off point is expressly to envision what else is needed besides further engineering today’s systems, so the overlap might not be satisfactory. But I’d be interested to hear other takes.
@mobiusinversion Před měsícem
@@oncedidactic thank you and I understand. I think my question is about assessing the ground truth of the word “needed”. I’m curious where this touches ground with comprehensible needs. What do you mean by needs and what is this addressing?
@patrickjdarrow Před 15 dny ⁺¹
@@mobiusinversion I took it that the work is motivated at least in part by the issues outlined early in the talk: explainability/interpretability, intractable architecture search, instability. These are the issues and the solutions potentially yielded by refounding ML in category theory are the “needs”
@mobiusinversion Před 13 dny
@@patrickjdarrow this sounds like publish or perish obligatory elegance. ML models of any reasonable power are non testable, that’s a fact, there are no QA procedures, only KPIs. Similarly, interpretability should be done at the input and output levels along with human I the loop subjective feedback. Personally, I don’t see AI and category theory going anywhere outside of Oxford.
@CristianGarcia Před měsícem
Thanks!
Vibes from the first 5 mins is that FSD Beta 12 seems to be working extremely well so the bet against this will have a hard time. Eager to watch the rest.
@MachineLearningStreetTalk Před měsícem
I've not looked into it recently. I'm sure it's an incredible feat of engineering and may well work in many well-known situations (much like ChatGPT does). Would you trust it with your life though?
@alexforget Před měsícem
For sure you are right. Humans don't need to drive for 1000 of hours in each city.
But if Tesla has the compute and data, they can always add a new algo and also win.
@tomaszjezak0 Před měsícem ⁺¹
Regardless, the method is hitting a wall. The next problem will need a better approach, like they talk about
@acortis Před měsícem
mhmm, ... interesting how we can agree on the premises and yet drift apart on the conclusions. I desperately want to be wrong here, but I am afraid my human nature prevents me from trusting anyone who is trying to sell me something for which they do not have the simplest example of implementation. And here you are going to tell me, "It is their secret sauce, they are not going to tell you that!" ... maybe, and yet I feel like I spent almost two hours of my life listening to a pitch for "Category Theory" which only implementation is a GOAT, that does not mean the Greatest of All Theories. ... Again, would not be happier that being proved wrong with the most spectacular commercial product of all time!
... oh, almost forgot, great job from the part of the hosts, love the sharp questions!
@markwrede8878 Před měsícem
We need A Mathematical Model of Relational Dialectical Reasoning.
@derekpmoore Před měsícem
Re: domain specific languages - there are two domains: the problem domain and the solution domain.
@Siroitin Před měsícem
35:45 yeah!
@EnesDeumic Před měsícem ⁺¹
Interesting. But too many interruptions, let the guest talk more. We know you know, no need to prove it all the time.
@dr.mikeybee Před měsícem ⁺¹
FFNNs are functionally complete, but recursive loops are a problem. They can be solved in two ways however. A deep enough NN can unroll the loop. And multiple passes through the LLM with acquired context can do recursive operations. So I would argue that the statement that LLMs can't do recursion is false.
@MachineLearningStreetTalk Před měsícem ⁺³
LLMs can simulate recursion to some fixed size, but not unbounded depth - because they are finite state automatas. I would recommend to play back the segment a few times to grok it. Keith added a pinned note to our discord, and we have discussed it there in detail. This is an advanced topic so will take a few passes to understand. Keith's pinned comment below:
"Traditional recurrent neural networks (RNNs) have a fixed, finite number of memory cells. In theory (assuming bounded range and precision), this limits their formal language recognition power to regular languages [Finite State Automata (FSA)], and in practice, RNNs have been shown to be unable to learn many context-free languages ... Standard recurrent neural networks (RNNs), including simple RNNs (Elman, 1990), GRUs (Cho et al., 2014), and LSTMs (Hochreiter & Schmidhuber, 1997), rely on a fixed, finite number of neurons to remember information across timesteps. When implemented with finite precision, they are theoretically just very large finite automata, restricting the class of formal languages they recognize to regular languages."
Next, according to Hava Siegelmann herself, who originally "proved" the Turing-completeness of "RNNs"), we have:
"To construct a Turing-complete RNN, we have to incorporate some encoding for the unbounded number of symbols on the Turing tape. This encoding can be done by: (a) unbounded precision of some neurons, (b) an unbounded number of neurons , or (c) a separate growing memory module."
Such augmented RNNs are not RNNs, they are augmented RNNs. For example, calling a memory augmented NN (MANN) an NN would be as silly as calling a Turing machine an FSA because it is just a tape augmented FSA. That is pure obscurantism and Siegelmann is guilty of this same silliness depending on the paragraph. Distinguishing the different automata is vital and has practical consequences. Imagine if when Chomsky introduced the Chomsky Hierarchy some heckler in the audience was like "A Turing Machine is just an FSA. A Push Down Automata is just an FSA. All machines are FSAs. We don't need no stinking hierarchy!"
arxiv.org/pdf/2210.01343.pdf
@dr.mikeybee Před měsícem
@@MachineLearningStreetTalk LOL! You're funny. I love that "We don't need no stinkin'" line. I loved the movie too. Anyway thank you for the very thoughtful response. I was aware of the limitations of finite systems, but I love how you make this explicit -- also that a growable memory can give the appearance of Turing-completeness. That's a keeper. Language is tough. Because I talk to synthetic artifacts every day, I'm very aware of how difficult it is to condense high-dimensional ideas into a representation that conveys intended ideas and context. And of course decoding is just as difficult. Thanks for the additional context injection. Cheers!
@samferrer Před měsícem
The real power of category theory is in the way it treats relations and specially functional relations.
Objects are not first class anymore but a mere consequence ... hence the power of "yoneda".
Yet, I don't think there is a programming language that brings the awesomeness of category theory.
@lukahead6 Před měsícem
At 32:27, Paul's brain lights up so brightly you can see it through his skull. Dude's so electric, electrons be changing orbitals, and releasing photons
@robmorgan1214 Před měsícem
This focus abstraction whether algebraic or geometric is not the correct approach to this problem. Physicists made the same mistake with geometry.
@preston3291 Před 29 dny
chill with the sound effects
@Dr.Z.Moravcik-inventor-of-AGI Před měsícem
Guys, it must be hard for you to talk about AGI that is here already since 2016.
@Walczyk Před 28 dny
we need more cutaways with explainers
@CharlesVanNoland Před měsícem
Also: Tim! Fix the chapter title, it's "Data and Code are one *and* the same". :]
@MachineLearningStreetTalk Před měsícem
Done! Thank you sir
@nippur_x3570 Před měsícem
About NN and Turing completenes: I don't understand how you need specifically read/write memory to have a Turing Complete computing. You just need a Turing Complete Language like Lambda Calculus. So, I don't see any obstruction for neural network, with the right framework and the right language (probably using category theory) to do it.
@drdca8263 Před 28 dny
Well, you do need like, unbounded state?
But I think they are saying more, “a FSM the controls a read/write head, is sufficient for Turning completeness”, not “that’s the only way to be Turing complete”?
To put a NN in there, you do need to put reals/floats in there somewhere I think. Idk where you’d put them in for lambda calculus?
Like...
hm.
@nippur_x3570 Před 27 dny ⁺¹
@@drdca8263 Sorry for the misunderstanding it's not my point. My point is that the read/write state for the Turing Completeness property is not probably the right point of view on this problem. Lambda calculus was just to illustrate my point. You "just" need complete symbolic manipulation on a Turing Complete language for the NN to be Turing complete
@drdca8263 Před 26 dny
@@nippur_x3570 I think the same essential thing should still apply? Like, in any Turing complete model of computation, there should be an analogy to the FSM part.
The NN component will be a finite thing.
Possibly it can take input from an unbounded sized part of the state of the computation, but this can always be split into parts of bounded size along with something that does the same thing over a subset of the data,
and there will be like, some alternation between feeding things into the neural net components and getting outputs, and using those outputs to determine what parts are next to be used as inputs,
right?
@GlobalTheatreSkitsoanalysis Před měsícem
In addition to Number Theory..any opinions about Group Theory vs Category Theory? And Set Theory vs Category Theory?
@hammerdureason8926 Před 26 dny
on domain specific languages -- hell is ( understanding ) other people's code where "other people" includes yourself 6 months ago
@jondor654 Před měsícem
Is the thread of intuition the pervasive glue that grounds the formalisms .
@mootytootyfrooty Před měsícem
yo okay here's what's up
Give me compact ONNX binaries
but with non-static weights.
@HJ-gg6ju Před 29 dny
Whats with the distracting sound effects?
@lemurpotatoes7988 Před měsícem
I don't see why types are more or less of a problem than values of the same type that are very far or different from one another. Suppose that every piece of data that goes down Branch 1 ends up with its output in a chaotic ugly region of a function and every piece of data that goes down Branch 2 ends up in a nice simple region. You can have a function that handles both cases follow, yes, but that's the exact same scenario as writing a function that takes in either lists or trees as its input.
@lemurpotatoes7988 Před měsícem
I know neither category theory nor functional programming and I didn't grok Abstract Algebra I, I'm just coming at this from an applied math and stats perspective.
@CybermindForge Před měsícem
@kiffeeify Před 25 dny
There is a brilliant (and also quite different) talk quite relevant to the stuff discussed around 16:30 from one of the rust core devs, they call it generic effects of functions. I would love to see a language that supports stuff like this :-)
czcams.com/video/MTnIexTt9Dk/video.html
One effekt would be "this method returns", others could be "this method allocates", "this method has constant complexity"
@jonfe Před měsícem
maybe we should get back to analog to improve our AI.
@ICopiedJohnOswald Před měsícem
The part about if statements was very confused. The guy said that if you have an If expression where both branches return type T then you need to return a union of T and T and that that is not the same as T. This is wrong.
If you look at the typing rules for boolean elimination (if expressions) you have:
Gamma |- (t1 : Boolean) Gamma |- (t2 : T) Gamma |- (t3 : T)
------------------------------------------------------------------------------------------------
(if t1 then t2 else t3) : T
In other words, an if statement is well typed if your predicate evaluates to a boolean and both branches return the same type T and this makes the if expression have type T.
@user-qm4ev6jb7d Před měsícem
Agreed, putting it in terms of unions from the outset is rather weird. But I can see how one would arrive at a rule where boolean elimination is always in terms of unions. Specifically, if one is approaching it from the side of a language like Typescript, in which unions are already everywhere. Typescript's types are weird.
@ICopiedJohnOswald Před měsícem
@@user-qm4ev6jb7dI can't speak to typescript other then to say yeah that is probably a bad place to get an intuition for type theory, but talking about Union types (NOT Disjoint Union types), isn't it the case that `Union T T = T`? Regardless, you don't need Union types to deal with if expressions.
I think the interviewer generally had trouble thinking with types and also was conflating type theory and category theory at times.
@user-qm4ev6jb7d Před měsícem
@@ICopiedJohnOswald Even if we assume actual union types, not disjoint ones, claiming that Union T T *equals* T is a very strong claim. Not all type theories have such a strong notion of equality. Am I correct that you are looking at it specifically from the perspective of *univalent* type theory?
@MachineLearningStreetTalk Před měsícem ⁺¹
Sorry, this is not an area of expertise for us but we hope to make more content and explore it further
@ICopiedJohnOswald Před měsícem
@@user-qm4ev6jb7dSorry I was playing fast and loose in youtube comments, disregard that comment. And no I'm not taking the perspective of univalent type theory as I am woefully under read on HoTT.
@felicityc Před 24 dny
You cannot use those sound effects
Please find a new sound library
I'm going to flip
@explicitlynotboundby Před měsícem
Re: "Grothendieck's Theory-building for problem solving (czcams.com/video/rie-9AEhYdY/video.html) reminds me of Rob Pike Rule 5: "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."
@mbengiepeter965 Před měsícem
In the context of Large Language Models (LLMs) like me, the analogy of training data as the program and the neural network as the compiler isn't quite accurate. Instead, you might think of the **training data** as the **knowledge base** or **source code**, which provides the raw information and examples that the model learns from. The **neural network**, on the other hand, functions more like the **processor** or **execution environment** that interprets this data to generate responses or perform tasks.
The **training process** itself could be likened to **compilation**, where the model is trained on the data to create a set of weights and parameters that define how it will respond to inputs. This is a bit like compiling source code into an executable program. However, unlike traditional compilation, this process is not a one-time conversion but an iterative optimization that can continue improving over time.
@rylaczero3740 Před měsícem
Imagine imperative programming for a sec, now you know monads.
@Wulk Před 26 dny ⁺²
Bro can build an Ai but doesn't know how to turn off twitch alerts 💀
@womp6338 Před 25 dny
If you guys are so smart why are you vaccinated?
@felicityc Před 24 dny
I'm not a fan of tuberculosis
@cryptodax6922 Před 28 dny
33:56 mins blowing conversation so far
@Hans_Magnusson Před měsícem
Just the title should scare you
@henryvanderspuy3632 Před měsícem ⁺¹
this is the way
@sabawalid Před měsícem
Programs are data, but Data is NOT programming - the data a NN gets will do nothing without the algorithms of SGD and BackProp.
@BuFu1O1 Před měsícem ⁺¹
"stop manspreading" 25:00 hahaha
@MrStarchild3001 Před měsícem ⁺¹
Tesla trying to memorize the infinity is a profoundly wrong claim. First off, ML training isn't memorization it's learning. Second there's a difference between producing inference results for infinity many inputs and the infinite model capacity. In short the guy in the first few minutes of this video is arguing based on a profoundly wrong promise. It's a strawman.
@yondamhokage1977 Před měsícem
use clojure

Další v pořadí

Automatické přehrávání

This is why Deep Learning is really weird.