>> I'm Daan and I'm honored to host today Conal Elliott,
former MSR colleague and a good friend of mine.
Conal used to work here at Microsoft Research until 2002,
and now he's a distinguished scientist at the target.
But the way he was here at Microsoft research,
I was so busy students and I remember being so inspired
by his beautiful work on functional reactive animations,
and at that time was really inspirational to me.
One thing that Conal is great at
is taking something really complicated.
In that time like animations you will always have
matrixes and frames and timers,
and he kind of rephrase it in a beautiful way
about the essence of animations as functions of time.
I think with this work he applies the same trick again
but now in terms of differentiation and machine learning,
it's always about all these complicated things,
and you almost don't see the essence,
and this talk will hopefully shed some light on it.
So, thanks a lot, Conal.
>> Thanks, Don. It's great to be here.
Yes, I was at Microsoft Research
in the graphics group from
'94 to 2002, I had a good time.
So, this is my first time back giving a talk.
So, anyway, we're going to talk about
automatic differentiation which has
to do roughly with computing derivatives automatically.
So, a good place to start
would be a question, what's a derivative?
All right, this is like the heart of the matter.
If we don't know what this means
what's kind of hopeless to have
any clean rigorous illuminating story
about automatic differentiation.
So, if you had an education like me,
you probably, let's say,
I took calculus one in high school and I learned
a derivative of a function at point is a number,
and that number has to do with the ratio of how fast
the output of the function is growing
relative to the input at that point.
Sounds kind of calculus one perspective.
But a little later, turns out well
it might not be a number, it might be a vector.
So, if my function yields a higher dimensional result,
not a scaler, not just like reals,
we can get a vector out.
But the function could instead consume
a higher dimensional result in which case
we get a different kind of vector.
These two are often not
distinguish which leads to all kinds of
confusions and bugs in
software but they're really importantly different.
They look alike but they mean
radically different things though, vector co-vector.
Then, a little later,
if we have higher-dimensional domain and co-domain,
input and output of function in
which case we're going to have to have
a matrix of partial derivatives
which is sometimes called the Jacobian matrix.
So, maybe we've arrived at the fully general story here,
a higher dimensional domain and co-domain but,
well, there are higher derivatives.
A higher derivative is
a derivative of derivative that's derivative two.
So, higher derivative is going to be even
higher dimensional than two.
For instance, so you get these kind of arbitrarily
high dimensional tensors column.
So, anyway, and then,
we have all these different notions of
derivatives and every
one of those notions has its own chain rule,
they're all different from each other.
Of course, when you see this many variations
there probably going to be other variations as well.
So, we get a complicated incomplete story,
and this isn't going to go well for a calculus student,
it's not going to go well for
computer scientitists or software engineers.
So, we need a better story. So, let's start over.
Instead of saying that the derivative of a function is
a number of vector co-vector
matrix or some kind of higher-dimensional beast,
there's one general beautiful unifying story
which is the derivative is a linear transformation.
That's the general story.
So, I'm gonna talk about D. D is the differentiator.
So, differentiation is a function,
I'm going show you some notation.
If it's unfamiliar or you're confused,
Pauleen interrupt me and ask
questions or about anything else.
I enjoy interaction during the talk,
then they'll be time for questions at the end as well.
So, this is a bit haskell in notation,
so this means that D has this as the following type.
So, differentiation consumes a function from a to b,
and it yields a function from a to
linear transformations from a to b.
That's the general notion of a derivative.
All of these other notions are
representations of linear maps.
All of the chain rules are
implementations of the same thing which is,
which I'll show on the next slide.
So, here's the general definition of a derivative.
Which is, we're going to take a function f at a,
and we're going to perturb the input a little bit,
and sample f but we're also just going to take
the result and perturb it a little bit linearly.
That's the D of F, is going to be
a linear transformation D, f, a is.
So, we feed Epsilon in to this linear transformation.
So, this is the general definition of derivative.
Works for all dimensions,
it works for things other than
linear vector kinds of things.
It just need to be vector spaces
and a little bit more limits have to exist.
So, I learned perspective as an undergrad
from a book called Calculus on Manifolds,
Michael Spivak, 1965.
It's a beautiful little gem of a book.
It gets right to the heart
of what differentiation is about.
Now, as I mentioned there were
all these different representations of derivatives,
each of them have their own chain rule.
But really those chain rules all mean the same thing.
They all have some fun with multiplication,
you can multiply two scalars together,
you can multiply two vectors
in one sense and you get a
dot product in a different sense,
you get what's called the outer products
of inner and outer products,
or if they're matrices, you get matrix products.
But all of those notions of products
represent the same thing which is
composition of linear maps.
So, the one unifying
general chain rule is the following: It says,
"The derivative of a composition of two functions,
g compose f at a is, well,
we're gonna take the derivative of f at a,
and the derivative of g at f of a,
and fa, the derivative of g at f of a,
and I'm going to compose them.
So, normally you would see instead of composition,
you'd see some kind of multiplication here.
But a multiplication is there because
it correctly implements the composition.
But the real one single story is composition. Yes.
>> Would you go one bit when
you showed the linear function.
>> Yes.
>> That you're saying like,
normally that's what you get as a or we can implement
linear met like a matrix
or something representational that's function.
>> Yes, exactly. So, abstractly
we have a linear transformation from a to b,
concretely, it might have
a matrix that represents the linear transformation.
But the important thing is, if you think in terms of
representations, is that a matrix?
Is that a vector? Is that this kind of vector
or the other kind of vector and so on?
Normally like linear algebra packages,
software libraries will have
some matrix package and it has
all these unsafe operations,
like you multiply two matrices
and well, they have to conform,
the width of one has to be the
same as the height of the other and so on,
and the type systems usually don't give that to you.
But if you say linear transformations,
the type system gives you the co-domain.
You cannot have a type correct program
that tries to multiply matrices that don't fit.
For instance, moreover by talking about
it abstractly and this is really key to this talk,
we can innovate and use different representations,
not a matrix representation because
the matrix representation actually turned
out to be a bad idea even
though it's just about universally adopted.
So, this form
of competition is sequential composition.
But there's another kind which is parallel composition.
So, if you have two functions and
the output of one
matches the input of the other you
can sequentially compose them.
But there's another form which is you
have two functions and they have
the same domain and have the same input.
So, one takes you from a to c,
the other takes you from a to d. You can fuse
those functions together into one that
gives you the pair of results.
It's a very simple definition, we call this fork.
So, the fork of f and g when you apply it to a,
where you give a to each of the functions
and you pair up the results.
So, logically, this is parallel.
When I say parallel, I don't mean operationally,
I don't mean that runs in parallel.
It's a good idea for it to run in
parallel but that's a separate issue.
I mean semantically, there's no data dependencies.
Where sequential composition is
all about data dependencies,
parallel composition is free of them.
So, these two forms of competition are both important.
It's something that the differentiation, especially,
automatic or symbolic differentiation,
it's said that these
methods are all about the chain rule.
There not all about the chain rule,
the chain rule is only about sequential composition.
They have other clumsy ways so they
typically talk about parallel composition.
But we can talk about them really elegantly
in terms of this one notion,
and we got a lot of mileage out of
making these kind of single general definition.
Just go back to sequential composition, the chain rule,
you probably learned in
high school except in this generalized form.
So, the problem with the chain rule,
and if we're thinking about how to implement something.
Want to do an implementation, want to
be able to reason about it,
wanted to be correct
and optimize, all that kind of thing.
Well, it's a huge help
for the algorithm to be compositional.
What I mean by composition
is whatever operation on performing,
like differentiation of a composition,
I form by taking the derivative of each of
the two functions and then somehow combining them.
All right? Well, the chain rule doesn't work that way.
The chain rule isn't compositional,
and then if you look at the definition here,
the derivative of a composition is,
well, it involves a derivative of f and g,
but it also involves f itself.
So, this is going to be kind of awkward to
talk about it in terms of
reasoning and implementing compositionally.
But there's a simple fix.
The fix is instead of computing just the derivative,
so you get this function of a to b,
and we get back a function from a
that not only compute this derivative
but also the original value, and that's it.
This enhanced specification is compositional
because when you write out the implications of it,
computing the function and its derivative for f
and g involves just f and g and its derivatives.
>> So you mean the b on the right side is
the same as the b computed above the function?
>> Exactly.
>> The left- hand side?
>> Exactly. The left-hand side?
>> Both that function.
So, taking g and b,
is that b the same b?
>> Yes.
>> Okay.
>> Exactly. So all we're doing,
and here's the definition specification,
so I'm using this fork operator,
I didn't need to do that.
So, says that this enhanced derivative of
f is f itself and
the derivative of f. That's all
just don't forget f, that's all this is saying.
This one fix makes differentiation compositional,
and automatic differentiation does this.
You can find this any automatic differentiation
because you have to to make it compositional. All right.
So, there's one other really
lovely result from this perspective,
this calculus on manifolds,
or derivatives is linear maps,
and that's the following: The derivative of
a linear function is itself everywhere.
So, that's may sound weird.
But if you think about what derivatives really mean,
derivatives are all about linear approximations.
So, when you have the derivative of function
at a point, what you're saying is,
well, we know what the value of the function is.
But if we also know what the first
derivative so that like the tangent at that point,
that tangent is linear
and it's a local linear approximation.
At that point, it's exactly correct.
As it moves away from the point,
it's slightly incorrect and further you get
away it's more and more incorrect.
If you want to get a closer approximation
you could use a first and second derivative,
then you get some kind of local quadratic approximation.
It's going to be a much better approximation and so on.
But for differentiation, the first derivatives,
differentiation is all about
making linear approximations.
So, when I say the derivative of
a linear function is itself everywhere,
all I'm saying is that linear functions are
their own perfect linear approximations.
It's really like nothing, it's not completely obvious.
Now, this is something you know without knowing it.
If you say like what's the derivative
of x with respect to x,
you'd say it's one,
or 2x with respect to x, you say it's two.
So, but I just told you know
the derivative is the function itself.
That's because when we say the derivative of
the function x goes to x is one,
it's not one it's the linear transformation
that multiplies by one.
Which is the same as the identity function.
So, the derivative of the identity is the identity,
is the general way of saying it.
Sometimes it looks like if
you have scalar functions you might say it's one,
if you have non-scalar functions
you might say it's the identity matrix.
The identity matrix, so the
matrix that's one's on the diagonal,
zeros elsewhere, that's just a
representation of the identity function.
So, similarly, the first and second,
these are functions that take a pair
and give you the first or second element.
Those are linear also. So, those functions are
their own derivatives also everywhere.
In general, they could take the derivative
of a linear function f at a,
you get more f of a and f itself,
that's the derivative. All right.
So, now I've really shown you
automatic differentiation, it's all here.
If you take primitive operations, addition,
multiplication, sine, cosine and so on,
and we know what their derivatives are.
Just look them up or remember them,
or prove them analytically.
Then, you start composing things in
parallel and in sequence.
Then, we have all the linear functions
which is the easy case.
This is all there you start automatic differentiation.
So, if you read the literature
especially what's called reverse mode
which is what's used in Machine Learning,
extremely complex.
But it's because it has all this historical baggage.
It's because it was invented a
long time ago in the '50s or '60s,
and then reinvented by Machine Learning people,
rediscovered in terms of
something called back propagation,
but wasn't from this kind of heritage of
crisp mathematical specifications
and rigorous development.
So, if you get rid of
all the kind of historical accidental baggage,
graphs, and variables, and layers,
and things like that, the essence
really is just what I've shown you,
and we can now put these pieces together
into automatic differentiation, and then we're done.
Not really because I'm going to
show you some improvement.
So, what I want to talk about here now
is a nice way to package up this information.
So, I said we have linear functions
and they'll be some primitives,
sine, cosine and so on, and
then sequential and parallel composition.
Now, those are the basic building blocks,
those building blocks have
a mathematical history to them.
We can take those operations,
you can take that identity and composition,
there's an algebraic notion is called the category,
I don't expecting anybody to category theory background,
it's just the idea of I've got
some thing which you can think of
the function like thing.
I means there's a domain and co-domain.
It doesn't really have to be a function like
thing but you can think of it that way.
So, that's going to be this guy, so this
will be something like functions,
and has two things, identity and composition.
In the identity map something to itself,
composition takes something from b to c
and a to a and gives you an a to c. Okay.
So, this is a generalization of the idea
of the identity function and composition on functions.
There are certain laws, this is algebra,
so there's certain laws which is for here
just that the composition is associative,
and the identity is the right and left
identity for composition. That's all.
So, anything that satisfies,
it has these kind of operations
satisfies the clause is a category,
that's what category means.
But now, there are some additional operations.
So, in addition to category,
there is this more specialized notion
of a Cartesian category,
and all that means is there's a it has a notion
of products of pairs.
There are three operations: One is extract left,
other is extract right,
that's what it's going first second before.
So, given a pair, you can get the first element.
Given a pair, you can get second element,
and then there's fork which is two of
these funny arrows with the same domain,
and you get another one with
that domain but pairs in result.
So, I've talked about
these operations on functions and I'm just saying,
there's actually a wider context
not just mathematical functions are
computable functions as we use
in computer science to software engineering.
This will be really important
because automatic differentiation is
all about one of these things
that's not exactly a function,
it's a differentiable function.
It's a function that computes with derivatives as well,
Then, there'll be some other interfaces.
So, this is Haskell
looking stuff but I hope you get the idea.
It's talking about two different interfaces.
One specializes the other,
so that makes additional restrictions. They have laws.
The laws here say things like if you take two of
these arrows, you combine them,
and then you follow new compose with say,
extract left, then you get up the original one.
Exactly. I think there is nothing is surprising at all.
So, now I've shown
you all the building blocks of automatic differentiation,
and then I've shown you an interface,
which is a common interface not
specific to automatic differentiation.
It's one that come out of algebra
particularly in the '50s I think, 1950s that is.
Now, I'm just going to put all the pieces together,
and there's a Haskell program
if you don't understand what something means,
please interrupt. All right.
So, we're going to go back to our specification
and I've tweaked it a little bit.
So, I said we're going to define this type.
It's a type of differentiable functions from a to b.
So, that's D a b, and this is a Haskell.
It isn't just says that one of these guys,
differentiable function from a to b
has this representation.
It's a function from a to a pair of
b and a linear map from a to b.
Jack, what I showed you before,
but now I've just made this
little D thing here which is just a wrapper.
It's just something that says,
the type of this thing is
not the underlying representation,
it's just more kind of abstract notion.
Now, here's our specification.
So, this d hat,
we're going to take a function from a to
b and give you a D from a to b.
What does it do? Just what it did before.
It takes fn as its derivative,
combines them into one
but now we wrap it up with a D. So,
there's just a little bit of
this programming language wrapping.
I want to emphasize here,
the specification is not computable.
So, given an arbitrary computable function,
even if it's continuous,
even if it's differentiable in a mathematical sense,
it's not computable to construct its derivative.
So, that's not a shallow result.
So, one of the things that I really like about this,
what I'm sharing with you today,
is that we started with extremely simple,
compelling specification.
I'm saying we just take the function and its derivative,
mathematical derivative, not computable derivative.
Simple specification, and then,
we're going to go through a process of, well,
I'll show you the outline of it,
you can read the paper for details,
of transforming the specification into an implementation.
So, we get something that's not computable,
very simple and compelling,
we end up with something that's efficiently computable,
and not only that, but it's automatable.
That is the transformation.
So, here's the specification.
So, what we're going to do is automatic differentiation.
I'm going to organize the code
as instances of these classes,
category, and Cartesian, and a few other things for sine,
and cosine, and so on.
The specification is to say that
this not computable function D hat
preserves the category and
Cartesian structure. Here's what I mean.
It's just these five equations.
Two from the category interface
and three from the Cartesian interface.
The Cartesian just means about products.
Category just means about identity and composition.
So, what we're saying is that this D hat
of the identity is the identity.
D hat of g compose f is
the composition of D hats of g and
f. The left extractor goes
to itself in
the other category, that's the important thing.
On the right hand side, we see this id functions,
that id is differentiable function does
D. This composition is on functions,
this one is on differentiable function on D and so on.
So, the derivative of the fork
is the fork of the derivative and so on.
So, this is extremely simple specifications
of very compelling form.
What that means is, you can really
think about differentiable functions
just in terms of the specification.
That any intuition you
form about what does it mean when I'm composing
these two things is captured
precisely and elegantly by these five requirements.
So, these requirements are going to be the specification,
and then the trick is to derive
a correct implementation that's
computable to the actual implementation.
That's the game. Yes.
>> So, [inaudible] is a functor.
>> Yes.
>> It's a functor?
>> Yes it is a Cartesian functor.
>> Okay.
>> Yes. So, more generally it's a homomorphism.
So, if we have these abstract notions,
for instance, category and Cartesian,
they have a vocabulary,
identity compose, extract left to right, fork.
What we're saying is that D is
compositional over that structure
or distributes over that structure.
It's a homomorphism over that structure.
It's also called a functor, Cartesian functor.
>> [inaudible] is this a equals to a,
plus a linear a
which is your definition of identity of the merchandise.
>> Yes. So, this is the regular identity function,
that's the differentiable identity function which
means it knows how to compute
a value and it's derivative.
>> Yes.
>> Yes, and so, on all of them are like that.
This is the general game,
we'll play this game over and over,
so it's it's important to
pause and appreciate what's going on.
Simple specification, very regular,
a simple definition of a function
and a specification that's always of this form,
that this kind of simple compelling definition is,
that distributes over the structure or preserves
the structure or the functor in category language.
So, if we take these five equations and we solve them,
it's an algebra problem.
So, we solve these five equations we get this result.
Now, how to get this real?
It's in a paper called
The Simple Essence of Automatic Differentiation.
It really just follows from the chain rule,
the fork rule which I haven't shown you,
and the linearity,
the property of linear functions
that are learned derivatives everywhere,
and then a few special things,
we know the derivative of negation, addition and so on.
So, solve those equations you get this form.
I'm just going to show you what this looks
like operationally.
So, for instance, this
is the one that says if f is a linear function,
the differentiable form D
is a function that takes a and gives you f of a and then
a. I could have written that differently as
just f fork constants where constants is the thing that,
whatever argument you give it, it gives you f back.
So, for instance, if you look at how composition works,
there's really two things in here because the meaning of
the specification D hat is we're going to have
the original function and
then simultaneously compute it's derivative,
so that's exactly what we do here.
So, we compose a differentiable versions of
g and f. We can differentiable function at a.
Well, what do we do at a?
We take f of a and we get its value b and the derivative.
Remember these guys g computes a function n and its,
sorry f and g. Then,
we take a and put it into g,
and we get c. That's going to be
the final result of the function
but we also get a derivative there.
Here we are using the composition to the functions,
the result, and then we compose
the derivatives, and this is the chain rule.
So, I did it, I just took the chain rule,
the definition, did in-lining,
wove the code together,
and then fork is quite
similar but there are no data dependencies.
So, for the two derivatives and
values are computed completely
independently, and then combined.
Then, for the basic operations,
negation and addition are linear,
so we can just use the general linear,
so it's identity, so we did the extract left and right,
that's why these ones were simple.
Multiplication is not linear,
so there's little more here.
So, multiplication is, well we're
going to take multiplication itself than
operation but we also need it's derivative.
So, the multiplication function at the argument a and b,
so by multiplying a by b, well,
it's going to give you a linear map,
and that linear map is going to
take a Delta a and Delta b,
and then it's going to scale b by
Delta a and a by Delta b,
and add them, and this is what's called the Leibniz rule,
rule for products differentiation.
So, that's really, that's
it, that's automatic differentiation.
>> When you get through your team a
and b wrong, define this.
>> So, think about linear maps,
there are different ways to think about linear maps,
so that's kind of the heart of the star, we'll get to.
A simple way to think about linear maps
is they're just functions that happen to be linear.
So, the derivative at
a point ab is going to give
you a function that is linear.
That function is what's a function from Delta a, Delta b.
So, this whole thing is the linear map.
If you look, yes, it's linear in Da and Db.
>> Yes, I am
thinking how you would normally represent this.
>> Yes.
>> Like they will be accepted, basically,
filling in all the possible values in there, yes.
>> Yes, so there are different ways
to represent linear maps
and we're going to talk about them.
There's some kind of traditional ways and
some non-traditional ways that work really well.
>> So, you gave instantiations
that satisfy the equations for the previous line?
>> So, what you're asking about this?
>> So, these instances that satisfy the equation?
>> Exactly. So, what
was provided here was these definitions here,
these definitions with organizing essences,
they satisfy this specification,
they are a solution,to these equations, yes.
>> But if you have
polymorphic functions that somehow distortions are weak.
>> Oh, you're asking about are there
some maybe some deep type
theoretic or directional properties.
>> Theories to say, okay, there's
only one function that's-
>> So, the remark is sometimes,
if we're in a pure functional programming in
strongly typed, but more than that,
in a parametrically polymorphically typed the setting,
there are certain inferences you can make that are
valid that depend only on the type information,
and this inference is one of them because these have
been automatically determined just based on the types.
I don't know. They are
not fully polymorphic because
they require vector space structure,
and so, this property is called parametricity,
which a deep property.
It's not as evidently applicable,
but it doesn't mean it's not an applicable.
>> So, it doesn't apply to the lollipop?
Is that what you're saying?
>> It doesn't apply to what?
>> The lollipop.
>> Oh, yes.
Also the Lollipop or the linear map types
that's on a fully general construction of wires notion of
vector spaces or at least modules,
let's call this a generalization.
So, sometimes in the presence
of these certain algebraic constraints,
they are still interesting free theorems.
I have no idea whether there are free theorems
that you know that relate to automatic differentiation.
Even to the extent that you're suggesting,
which is maybe this whole thing is a free theorem.
That would be very cool, and I really don't know.
There's something that's to me very
compelling and inevitable about this development,
and that may be a clue that
there's actually a free theorem.
It's maybe easier in a sense then,
then I think it is simpler.
All right, now I want to show you some examples,
and then we're going to get into
thinking about the performance characteristics
of this simple algorithm,
and how we can improve them.
So, I started out with
these three simple running examples.
This is Haskell code.
I'll talk you through it. If I
don't explain something that
you'd like explained, please ask.
So, here is a squaring function.
So, we're saying this squaring function,
square of a is a times a,
just defining this function.
Well, it's type is, it goes from some type a to a.
Sorry, this is kind of common pun we do in
Haskell programming we name
the variables in the types of the same things.
That may be a little confusing, and
a just has to be numeric for this to make sense.
So, it could be ints and floats and doubles
and other things complex numbers,
and maybe higher-dimensional things,
even functions and so on.
All right, but you can just think of this as being
a real number, a here.
A little bit more of
a sophisticated function, thus magnitude squared.
So, if you have a pair, a cross a,
where this is pairing, magnitude squared is,
it's just that the sum of the squares of
the components. So what do we do?
We take the pair a, b and we
square a and square b and add them.
This is a little bit less trivial example,
and here's another one, which is
the cosine and sine of a product of two numbers.
So, given x, y we're going to compute the sine and
the cosine of z, and what is z?
It's the product of x and y.
Okay? Now, why B3 examples because they
get to scalar versus
non scalar in different interesting ways.
This is a purely scalar function.
This one takes a non scalar, has non scalar domain.
This one has a non-scalar domain and codomain,
both, okay? All right.
Now, we can take these expressions, this is simple.
It's a pure function program,
and rewrite them in categorical vocabulary.
Okay, well, this is a big step
and it's kind of important step.
You may or may not be familiar with
this that there's a deep,
deep connection among foundations of mathematics.
Particularly, this approach to category theory on
the one hand of logic,
and not just one logic, but logics in general.
So, for one reason,
I guess another way to think of it and computation.
Okay? That at the heart of a computation,
reason in the foundation of
mathematics are all the same thing.
Okay? But that connection particularly between
computation and this foundational approach
mathematics category theory is,
well, it's not so obvious in mainstream languages.
So, if you have an imperative language like
C-sharp or Java or I don't know,
JavaScript, or worse, it doesn't even
have static typing, that kind of thing.
Well, this connection to the essence
of mathematics and category theory in
particular is quite obscured.
It's there, it's just hard to see and all the noise.
But if you take a very elegant, pure, simple,
rigorous notion of programming,
namely the typed Lambda calculus,
then there's a very clear connection between the two.
Okay? So, I'm not saying that only functional programming
corresponds to logic and category theory,
I'm just saying that that in that setting,
it's most clearly evident, okay?
This functional programming is equally expressive,
do imperative programming
object-oriented, whatever flavor you want.
Okay? But this connection is
quite simple in this setting of
a typed Lambda calculus and this
is just the sugar ring of the typed Lambda calculus.
Haskell is just typed Lambda calculus sugared.
That's a little more than that and it's essentially that.
So, there's a, as a result from 1980 by Joachim Lambek,
which says that if you take the typed Lambda calculus,
there are different models of it,
there are different interpretations of it.
Okay? So, the one that I'm used to it because I've been
doing computer programming for a while
is computable functions.
So, I think this code,
Lambda calculus it means computable functions.
But what Lambek showed in 1980 is,
now, that's just one interpretation.
That there are a lot of interpretations
that are consistent
with the laws of Lambda calculus,
and they're exactly the inhabitants,
they're exactly the models that
have a mathematical property
namely they are the Cartesian closed categories.
Okay? So, I haven't said what closed means,
closed means you have first-class functions.
Okay? Categories means you have identity and composition,
Cartesian means you have got
the first second projections,
the left are projections, and this fork operation.
Closed means you have a notion of first-class functions.
Anything that has those,
that's it satisfies those interfaces,
and I'm only showing you two of them look like,
and it satisfy the laws,
every one of those things is
a consistent model of the typed Lambda calculus.
So, the way you can think of that is,
every one of those is
a valid sensible interpretation
of a purely functional language like Haskell.
Moreover, we know how to automatically convert
the Lambda calculus into this language.
So, that's the idea
here is that we're going to write in Haskell,
we're gonna automatically, so not the programmer,
the compiler automatically translates to this vocabulary.
Then, instead of interpreting
this vocabulary in the usual sense,
computable functions, I want to
interpret in some unusual sense.
In our case, it's the differentiable functions.
That's the game.
>> If you say all transformation
from Haskell to category three,
and this is that you're inline
the square function in there,
do you need to do that in
Haskell log transformation and
break modular boundaries or?
>> No thanks. So, the question is, well, first of all,
it's an observation that home I translate a square here,
but then I didn't just use square,
which I could have now this is
purely just where the implementation is at this point.
It's a little awkward at this point for me
to preserve that modularity.
But there's nothing.
>> There's no deep reason.
>> There's nothing, yes.
So, at some step, this would
look like multiplication composed.
Sorry, it will be square composed with exl,
and that would be square composed with the exl.
That'll be more readable and more modular, yes.
Just not quite there yet.
All right, so for squaring, what do we do is, well,
we take the identity function and itself together,
work them together, that gives you function
that you give it a and it gives you a and itself.
Just duplicates, so that's a duplicating function.
Then, compose that with multiplication.
So, you take a value, duplicate it,
and then you take the product of
that pair, that squaring.
Okay, what do we do here well,
we take the left extractor and
itself and that gives us x out of x y,
and then we take it fork with itself we get x x, right?
We multiply that so that's x squared,
this is y squared, we do
those in parallel and then we add the result.
We see this is composed,
so you read from right to left.
Cosine sine progress even simpler you take the product,
and then in parallel of the sine, the cosine and sine.
So, what I'm saying is,
we already know how to take
derivatives if we write our programs and this vocabulary,
and that's what the previous slideshows.
So, if we take these definitions
of the operations like add,
compose, fork, and so on,
and we use those interpretations here,
we'll get versions of
these functions that compute not on the regular values,
but their derivatives exactly, all right.
But, this is a terrible language
for person to have to write in, So,
one of the language for a compiler,
or mathematician to reason about but it's
a terrible language for a programmer to write.
Okay, so fortunately we have
this automatic translation from
a kind of human-friendly language,
maybe debatable to mathematically
austere and easy to reason about language.
So, now I want to show you some examples of
this interpretation applied
for automatic differentiation.
So, here again is the magnitude squared
function, sum of the squares,
and here's it's categorical translation,
I just mean this is exactly the same thing but rephrased,
and this is automatic, this step.
Now, if we interpret,
what have we done here?
Oh yes, so this isn't differentiation,
this is just the function itself.
So, this is what the category expression looks like,
this function we can visualize in terms of a graph,
and this is what it looks like.
So, this kind of emphasizes the parallel nature
of fork here and that's this fork right there.
Okay, so what do we got?
Here's the extract left,
and it's getting fed in,
both of them to multiply, that's this one.
Also, the right extractors get fed
into another multiply, these two in parallel,
their results, they take the same argument and
their results get fed in to addition.
So, what do I want to say here?
Oh yes, all right now in fact
there are three different categories going on here,
okay one category is the one
that we just apply unconsciously,
at least I do which is just
to think that this means functions,
plano-mathematical functions are computable functions,
that's the usual one, we have in programming.
This display here, this categorical expression,
expression in terms of this funny category language,
it's another category itself.
So, I have a category that it then
interprets composition as well in a syntactic way,
multiplication makes all the operations
syntactic and then it pretty prints the result,
okay that's just for my convenience.
Another category, it builds a graph,
okay and then I display that graph here.
Then another category is the derivative one,
the differentiable one which I'm going to show you next.
Okay, but important thing I want to say here is,
automatic differentiation is almost
always talked about in terms of graphs,
especially in Machine Learning,
especially in reverse mode.
I think that's always a mistake.
Okay, because differentiation has
nothing to do with graphs,
unlike differentiation by differentiation,
differentiation is about functions,
there's no graphs there.
So, you can think about any domain of interests,
anytime you're doing programming, and you've got well,
there's some compositional thing
going on, some algebraic,
something rather very expressive,
which of course want we want our APIs to be expressive,
it means you use them multiple times,
do you feed the output of one operation
the input of another thing.
Every API you can think of as being about graphs,
arithmetic is an API, okay?
Addition multiplication and just like literal numbers
is an API that you learned as a young child,
and you can think of it as being
about graphs, but it's not.
No. Well, unless you're
really didn't graph theory or something like that,
it's not about graphs,
it's about its domain,
that domain has a vocabulary.
It hopefully has a type system that
mediates what gets composed with what.
You can think of those things as
graphs, but you shouldn't,
a compiler writer thinks about them as graphs.
A graph theorist maybe
might want to think about them as graphs,
but differentiation has nothing to do with graphs.
So, I want to make
that point because it distinguishes his work
from almost all other AD work
that I now especially where is known.
But I also want to say you're going see
a lot of graphs and you might be tempted to
think I'm doing something like
these systems TensorFlow or something,
that's all hung up on graphs like no,
these graphs are just about visualization,
they have nothing to do with automatic differentiation.
Even when I show you the result
of MI differentiation as a graph,
the graph came second, not first.
Meaning it's a way for me to show you the result,
not a way to compute the derivative.
Okay, so here's an AD example,
I'll make a little simpler this is the squaring function,
this is the expression
in this category language and this is a graph that
just shows you what
the computation looks like, all right?
Now, what I get here is this is
the derivative of the differentiable function version,
so if I take this expression and interpret it in
this category
of differentiable functions that I showed you.
Then I take the resulting function
on the inside the representation of it,
and I turn it into a graph,
I reinterpret it as a graph, this is the result we get.
So, this is the original function, so that's a,
that's a squared, a times a,
and that's the first result.
The second result is going to be
a function that's linear,
and the way I'm showing functions
here as this little green box,
and the input is, in
this input and out is the output of the function.
So, what do we have here? This is a,
this is the linear function that takes,
I'm going to call it da, okay It's just a little Delta.
So, we get a times da times 2, all right.
So, this is the linear map that maps da to 2ada.
You're used to thinking of that as the derivative is 2a,
the derivative a square root is 2a.
It's not, it's da times 2a.
Okay, that's a linear function,
it's a function from da, that multiplies da by 2a.
When we say the derivative of a
squared is 2a, that's what we mean,
it means multiply by 2a because it's linear.
Okay, here's one more sophisticated example,
the sum of two squares,
okay this is the categorical
expression and when we interpret
these operations in
the differentiable function category, we get this result.
Again, we're going to compute the original function here,
so this is what ab,
so that's a squared, b squared, and we're adding them.
That's the first result, and
the second one is the derivative.
So, it's going to take a da db,
and here we get da times b,
that's b, that says a times db, this is what?
This is b times b,
times db, that's not right.
Okay, so we get two,
Oh yes! Of course, that's exactly right.
Yes, so never mind when I say,
yes this is correct my brain was incorrect.
This is a da times a,
doubled, those are two ada,
this is 2bdb, we're going to add them.
That's the linear map, right.
>> So, there is no real magic here,
right? Find the perimeters.
What made you like, cosines specials include this once,
and have so that you just compose the-
>> Exactly.
>> The d guys and I said what.
>> Exactly.
>> Right, Yes, there's no magic here.
>> No.
>> Yes.
>> Exactly.
>> Except for derivatives.
>> Yes.
>> They used to fit in.
>> Yes.
>> First of all its composition.
>> Yes, so we have to know derivatives our primitives,
and then every composing form,
in which there are two main ones,
seem much longer than a parallel.,
know how we compose
these differentiable functions, and that's it.
That's all there is done like differentiation.
Okay. Yes, thank you,
I want to emphasize the simplicity of
what's going on here particularly in
contrast to the complexity of what goes on in
the literature about automatic differentiation,
particularly what's called reverse mode,
which is the mode that's needed for machine learning and
other uses of high dimensional optimization problems.
>> A question.
>> Yes.
>> How would you deal with sums,
and I'm thinking about like discontinuities and so on.
>> Yes, so some types.
>> Yes.
>> Yes, so the question is, so I've shown you how to-
what- we can deal scalars,
and we can deal with products.
So, what about when we have
sometimes of disjoint unions that kind of thing.
There's a few angles on the question,
but the first and most fundamental question
is differentiation even meaningful,
because once you once you have some types,
you no longer can have purely continuous functions,
well, you can but none of them are interesting, right?
They would have to always be the left or
the right disjunct of the sum. all right.
So, I don't know,
I'm going to show you something that is similar
but interestingly different from sums,
but your question yes it's really about sums.
Is something I've thought about,
I think it's probably a really lovely answer,
and I don't know what it is.
>> I suppose in some way you can encode
some and fit inside your system what does numbers.
>> Yes.
>> What does it look like at the end of the day.
>> Yes, so you can church encode it,
though you can represent,
so some sometimes is like disjoint unions,
it's either that or that plus a tag,
something that tells you which it is.
So, even when you add like wo or b,
you'll still get to know which real,
the left one or the right one,
that's why we call disjoint union.
I forgot what your remark was.
>> What the are few- I mean church encode aside.
>> Oh yes, church encode. Yes, so
these types can be encoded in terms of functions,
actually even pairs can be encoded in terms of functions,
but actually that's not enough,
you need universals as well,
universal quantification, yes.
What the universal quantification then n function,
then you can get pairs and sums,
but I don't know like if you encode it,
even products before we get to
sums encode products in that way,
what happens to the notion of differentiability,
because they'll just ride along in
a sensible way with the church encoding,
that's a really interesting question.
If the answer is yes,
then it will be encouraging to try
the same thing with sums.
>> It might get to the same round of [inaudible].
>> Yes, yea, it might.
>> Yes, these are interesting questions.
>> The questions is always busy,
that the church encoding is an instance of continuation.
>> Yes, okay, and we'll get into that.
>> So, I'm sure we'll get continuations.
>> Yes, we'll get continuations. All right.
>> For church encoding you need rank two types-
>> Yes.
>> Which corresponds to cones in category
were they to extend here in your theories?
>> Yes absolutely, yes.
What I love about this framework if you think about
automatic differentiation and not
just that but a lot of different problems,
a lot of people do, prompts if we can reexamining
in terms of this the language
of the seventh category theory.
What I led us at the space locations are so simple.
All right?
So if I was trying to start from back propagation,
as it's described in the Machine Learning
literature or the AD literate.
I wouldn't have a prayer,
I don't even understand what this algorithm,
I don't understand its meaning
or whether it's correct or not,
without introducing these complications.
But, when we get it to this really simple foundation,
then we can revisit these questions.
Yes, what if you do charge encoding,
what if you do sums,
what if you aren't happy with the performance.
You get out of here and you want to reason.
>> So, your objective is you given the circuits,
an arithmetical circuit, you want to differentiate it?
>> I'm given a function on a circuit.
>> Okay.
>> But there was somebody say.
>> You don't want to listen here.
>> Okay, yes.
>> You're given the representation
of an hypothetical function.
>> Yes, you're right.
>> Your objective is to compile it,
compile the derivative and optimize it.
>> Yes.
>> That's the main goal.
>> Yes.
>> It's derivative.
>> Repeat the question, it is my goal
here to take a description of mathematical function,
and generate code that computes its
derivative correctly and efficiently,
is that the question, right?
>> Yes.
>> The answer is yes.
Yes that's exactly my goal.
Now, it is important distinction,
we're given a representation of a mathematical function,
given a mathematical function
itself, it's not computable.
We can't compute its derivative, but
given a representation yes.
So, there are different representations you might choose.
A popular one is graphs.
What I'm saying is that's always a bad idea
compared with this much simpler setting of saying,
"Well, we're going to take the Lambda calculus."
Moreover, that's just a starting point,
what's really much more convenient
is this categorical expression,
of which we can automatically translate.
That step has nothing to do with differentiation.
>> There is also this type of symbolic differentiation.
>> Yes.
>> Is it different? I mean I don't understand variable,
the numerical Fortran programs.
>> Yes.
>> But it seems the high-level goal seems identical.
>> Yes, it is.
>> Yes.
>> So, if you read the AD literature,
maybe some Machine Learning, AD literature.
A lot of paper start out by saying there are
three different ways to compute
derivatives on a computer.
One is as they call numeric or approximation,
that there is called the secant
method or method of differences,
those are all different terms to the same thing,
and it's just you take a couple of points,
hopefully near each other,
you draw a line,
and you say that's roughly the derivative.
Okay, well it's incorrect.
>> Isn't that a miracle to it.
>> That's yes, but
America's people will call it
in America that's not a good description,
because automatic differentiation in numeric too.
If you look at it one way it's numeric,
if you've got another way it's symbolic.
So, just call the first
one numeric is a little misleading.
>> Okay.
>> Yes, because you can set difference between AD
and automatic and symbolic differentiations,
AD is numeric and symbolic is symbolic.
But that's a funny perspective,
I think there's a quite popular perspective,
but I don't think it's a very good.
>> Well.
>> Well because-
>> Also, seems all symbolic you
just working with symbols, right?
>> Yes, I think so, it's one of
these things that I would agree with you with
except that what you're saying is true, but misleading.
>> Okay.
>> In the following sense,
so it's made factual but it's
not true in the following sense,
everything that is done by a compiler is symbolic.
>> Sure.
>> Okay, so to talk
about something and say it's symbolic,
its not, if you're in a compiler, everything is symbolic.
Once you're doing runtime,
you can say everything's numeric.
Well, okay if your program is doing
a symbol manipulation like symbolic differentiation,
then okay, it's not
a very useful way to think of it as it's numeric,
of course it is because it's
using bit and coding of things.
But that's not very helpful perspective, yes exactly.
So, everything is symbolic when done by compiler.
Automatic differentiation, the story is,
there are three ways to compute derivatives,
one is incorrectly by numerical approximation,
that's called the secant method or
method of differences, okay?
That's very popular, but it's
terrible performance-wise and accuracy-wise.
Another is what's called symbolic differentiation,
which is you write code
that does what you did in high school,
which is like manipulate symbols and we'll see
that the derivative of sine u is cosine udu,
and so you're writing something that is manipulating
some representations of mathematical expressions, okay?
So, that's typically called symbolic
and there's automatic differentiation,
okay, which is what I'm showing you,
which uses the chain rule and so
on.But from my perspective,
I don't think that's a very helpful description.
I don't think it's true.
>> Because it's higher level, you're saying it's
more mathematical, that's why numeric.
>> What I'm saying is that-.
>> It has algebraic rules.
>> Yes.
>> It was suppose to organize when you
reorganize code in a compiler.
>> What I'm really saying is this,
automatic differentiation is symbolic differentiation
performed by a compiler.
That's my perspective on
the relationship between the two.
So you won't find that statement in the literature,
you'll find these two at
very different things, unlike differentiation.
Look at one symbolic
differentiates all kinds of problems.
I think what is true is symbolic differentiation
done badly has all kinds of problems,
but that's nothing to do with symbolic differentiation,
that has to do with doing things badly,
all right? So, don't do it badly.
>> Symbol.
>> In the context of a compiler,
it's convenient to do it well.
So, I'm doing symbolic differentiation,
I'm tricking a compiler into doing
symbolic differentiation.
No big deal because
compiler does everything symbolically.
I wanted to have good performance,
no big deal because compilers are good
at maintaining performance, okay.
>> Can I say that one difference
with usually use symbolic differentiation
that may not work, right?
You may write expressions that
cannot be enabled by the code,
in your case I guarantee that if I
write it within this limits and I print out, it's true.
>> Yes.
>> So, now [inaudible] yes.
>> It's factual and misleading yes,
so if you do
things symbolically, you can run into problems.
If you do it, here you can't run a problem,
now that both you can run it
the same problems in either
setting which is if you try to
express something outside of
the vocabulary of which for which we know how to
symbolically or automatically
differentiate, then we're going to run the problems.
So, maybe in one setting,
it's a little more commune
to respect your expressiveness,
maybe, I don't really know if it is.
Okey-dokey, so here's the third example,
the sine and cosine of the product, okay?
This is the undifferentiated version,
there's a differentiated version.
Okay, and what I want to show here is that there's a lot
of sharing of computation between the regular value,
which is sometimes called the primal and the derivative.
Okay, well this is a small example,
so there can't be a whole lot of sharing because
there isn't a whole lot of computation going on.
But this is crucial
and I forgot to mention this at first,
this one-step early on in
the talk when I said sequential composition,
the chain rule is not "compositional,
" ironically, centered about competition.
I made it compositional by
augmenting the function with its
derivative instead of just computing the derivative.
That augmentation not only makes it compositional,
it makes it possible to do computed
efficiently and that's because
the original definition of just the derivative
uses the derivatives of functions,
and the regular value of functions and usually there's
a lot of shared computation
between a function and its derivative.
So, by putting them both into one specification,
it makes it possible to optimize and share computation.
Okay? Not only possible it is quite easy because I
showed you the code that shares, okay?
With some what's called symbolic differentiation,
sharing is a big deal and
the main criticism is symbolic differentiation is that,
well you don't get sharing,
you end up doing like exponentially large
work to compute or
exponentially large work to optimize
away the redundant work at runtime,
so usually do a huge amount of work at
compile time or runtime.
All right, that's just
because people are used to
doing symbolic differentiation badly,
that's all, it's not
a problem with the symbolic approach and as I said,
this is symbolic but the compiler's doing it.
I don't even have to help the compiler
be smart about sharing because
compiler is smart about sharing,
that's a part of their job.
>> That really helps to create tree into a diag.
>> Yes, the compilers do that.
When you say let, that's not
the remark wants to turn tree into a diag,
what I would say is stop thinking about graphs, okay.
So, if you think about graphs,
then you're going to run it this thing like, "Oh my God,
it's easier to manipulate trees
compositionally than graphs," graphs are very awkward,
trees are super simple, it's like the simplest thing.
So, oops! But some trees are
exponentially larger than the graphs they represent.
In other words, trees don't have sharing,
you have to do duplication to represent the same thing.
So, you could think in terms of graphs, trees,
manipulation of graphs that refactor big trees
into into graphs with
more sharing that's a lot
of work and it's really complicated.
Just don't create the problem in the first place,
don't think about trees and graphs because
differentiation has nothing to do with trees and graphs.
Okay. So I go back to
the Haskell code that
is derived by solving these five equations,
the structure preserving nature
of the d-hat operation, okay?
So, this is the code I showed you a few slides ago,
and I want to note
that there is a something that jumped out at me,
when I first was working on this differentiation.
In this category language,
I started noticing a pattern,
and when I start nosing patterns,
I like to write pretty code.
So, I tried to pull
those patterns outside so could
really see what was going on.
I noticed the first one that really caught my eye is
the composition of differentiable functions involves
the composition of derivatives, okay?
The identity differentiable function involves
the identity linear map.
So, I got the differential
identity function involves the identity linear map,
is the derivative everywhere.
The composition of differentiable function
involves the composition of
the derivatives of linear maps.
So, not only that, but that's all it needed.
These definitions use nothing about linear maps
other than that they have identity and composition,
and that they satisfy the laws.
So, that was interesting.
Then get into Cartesian, oh, Interesting.
The left extractor, involves
the left extractor for the linear map.
The right extractor and the fork involves
the fork and that's all that it used.
So, this code is not only pretty and poetic,
which is really how I got to it,
but it says something very powerful,
which is that linear maps,
there's almost nothing you need to know about them.
You need to know that they form
a Cartesian category also.
That means that every notion
of every function like thing,
that has a Cartesian category,
and there's a lot of them,
can play the role of linear maps.
So, this observation and writing
the code purely aesthetic motivations,
it led to I think
a deep and very powerful insight, which is that,
automatic differentiation not only
can be expressed elegantly,
but much much more
generally than the original specification.
So, that's the point here.
Every one of these operations on
differentiable functions just used as
a corresponding operation on linear maps
and therefore, we can generalize,
we place the role of linear maps by
arbitrary other thing that has identity compensation,
projections and fork, and satisfy the laws.
So, all I've done from this slide to this slide,
is I've parameterized this notion.
So, D here, it had linear maps built in.
This D, the subscript says,
it's going to take a parameter,
it's going to take any old Cartesian category,
for instance linear maps, and use them here.
The definitions are unchanged.
Well, I made one little change.
I'm making a clear distinction
between linear D. It takes a function and its derivative,
they're always going to be the same.
Like identity and identity XL and XL.
But one will be on functions and the other will
be on the linear map like thing.
So, the code really is unchanged,
except for this duplication,
just to make that distinction clear.
Well, there's one awkward thing here,
the gate and add are linear.
So, we're going to take just the linear versions
of them. So here we are.
See, this is the extract left on functions, this is on,
I want to say a linear maps,
but it's not just linear maps,
It's arbitrary, this notion here.
But multiplication, that's awkward.
I don't really have a way to talk about it.
Because originally, I wrote it this way.
But this is functions.
So this linear map,
this is literally a function.
So, it's not a kind of an arbitrary Cartesian category.
Well, so, let's write this in another way.
I'm going to suggest a couple of pieces of vocabulary.
Next, tell you what they mean on functions
and then say we can generalize them also.
Then we get another expression
of a derivative of multiplication.
So, there are two things going on.
One is scaling and
the other is where we take a pair and we break it up.
We do some computations and we add the results.
Those are the two patterns that I want.
So, one of them is called scale.
So, scale of u,
is the linear function that multiplies its argument by u.
So, multiplication isn't linear.
That's what's called bi-linear.
But when you provide one argument to multiplication,
the result is linear.
Then there's this operation, I'm calling it join.
So, remember fork, it took
two functions or two arrows of the same domain,
and gave us and took an argument given to both,
and then paired up the results.
Join is the dual of that.
Sorry. Yes, that's right.
So, with Join, we have two of
these maps with the same co-domain.
The same result and possibly different arguments.
So, we're going to take a pair
rather than generating a pair.
So, it's just a dual.
A reverse of a fork. What do we do?
Well, when you do join f and g, you get a pair.
You parcel out the arguments,
give one to f, one to g, and add the results.
So, that I'm calling join.
With this vocabulary, we can just
refactor this expression and now we get this form.
lambda ab, we're going to join
scale by b, with scale by a.
So, why is b first? Because it's b times da, a times db.
So, now with this vocabulary,
I haven't really made this anymore general.
I've just described it differently but
I've introduced new vocabulary that does generalize.
So, here's a new vocabulary.
We have the category in the Cartesian forms,
and now we have this co-Cartesian.
It gives three operations that are the dual,
the reversal of the Cartesian operations.
So, instead of having two projections or extractors,
we have two injections.
Left and right. Instead of
taking two arrows with the same domain,
we take two with same co-domain.
So, this is the new vocabulary.
Then one more thing, settings
that know how to do scaling. All right.
So now, given this vocabulary,
now we can write linear transformations as functions.
So, all I'm doing is I'm repackaging when I said before.
So, this is a little new type.
This is just saying, we can going to represent
these additive functions,
as just regular old functions,
but with the context of additivity,
and the all the types involve are additive.
Now, where that additivity comes in,
I'm omitting from this slide, but it's in the paper.
The interesting new thing here,
is that we have this co-Cartesian operation.
So, to inject from the left,
we add a zero on the right.
To inject on the right, we add a zero on
the left and the join operation does the addition.
Parceling out as an addition,
and scaling just as multiplication.
So, with this definitions,
now we get back to where we were before.
We have essentially using
functions to represent linear maps,
but now we have this more elegant vocabulary.
Now, we can do something more interesting.
Now, we can change our representation.
Instead of using functions as linear maps to represent
linear we can use other representations.
Why would we want to?
What's wrong with using these these functions?
It's the simplest way to do it but has
a problem and here's the main one.
It's that sometimes we need to
extract a data kind
of representation from the linear map.
So, if we get a linear map,
it's represented by a function.
I might need to do something like, well,
a couple of examples are,
one is, if you're doing computer graphics,
and you're doing a lighting and shading,
so you got some 3D model and
the light source, surface properties,
and so on, to light a surface at a point,
we need to know the normal vector.
So you got the surface,
and you can think of the tangent plane or normal vector,
and we're going to take a dot product within
a normal vector and the vector between my eye,
and that point or the light and the point, I guess.
So, how do we get the normal vector?
Well, we take the cross product of
the two partial derivatives.
So for that, we need a data
representation, we need vectors.
How do we get from a function that we know is linear,
to a compact data representation?
A vector for it?
Well, there's a simple way to do it.
Which is you take the function,
as long as it's linear, you can sample it on a basis.
The results you get are the vector.
So, you take function or the matrix.
Take the function that if you know is linear,
sample it on a basis.
What's a basis? For instance,
take the identity matrix, all right?
So, the rows or columns you think about it,
the identity matrix, that forms the basis.
So, you apply the function to each of those vectors.
Then you accumulate the results,
that is the Jacobian matrix.
The data representation.
But it's terribly expensive.
In some settings, if the domain of that function,
so that's what we're on a basis of, is high-dimensional,
say it's a million and that's not a typical,
that's quite typical for machine learning.
So, it's in the millions.
Then I'm going to make a million by
a million identity matrix.
I got a million, what's called one-hot vectors,
which is zero everywhere and a one seller
and pump them through this function.
It's extremely expensive to renew quadratic work.
So, this is not a viable approach.
This is why we wouldn't want to represent.
If we're doing optimization problems,
particularly if we have high dimensional domains,
we would not want to use functions or
representation because it's just
too expensive to extract representation.
Another motivation besides computer graphics,
is just search, is
machine learning, deep learning in particular.
Want to do gradient descent, what do we do we?
We need a gradient vector.
Because that's what tells us in what direction to
adjust the input or searching for
a local minimum or maximum.
So, we could for an n-dimensional domain make n passes,
each on a one-hot vector,
very expensive and particularly
expensive for gradient-based optimization,
which is like the most
valuable application of these ideas.
So, ironically, we have a very low dimensional co-domain.
The result, it's one-dimensional
if we're doing optimization.
We have a high function that takes
some high dimensional input and gives
you an objective function value.
One. Unfortunately, the cost of
converting the function representation depends on
the dimension of the domain.
Relies on the domain being small.
But the domain is large, the co-domain is small.
So, this is a serious problem.
So, there's a solution which is
use matrices in the first place.
Don't use functions to represent
linear maps. Just use matrices.
That's what everybody really does. All right?
Well, in a nice typesetting,
there's lovely and general way to think about that,
which is that something in the matrixes,
like row columns, is two-dimensional rectangular thing.
Think of it as a structure of structures.
It's got a row structure and a column structure.
You know structure of structures and those structures are
dependent on the domain type and the co-domain type.
Okay, so that's what's going on here.
I'm not going to go into this in detail.
But if you define a notion of a matrix like thing,
in a more general setting,
and you'd find what it means to apply one of
the matrix to a vector,
and also multiply the matrix by the vector.
That gives us an algebraic problem,
two of the same form, these five equations.
Say that this apply thing preserves structure.
We can solve those equations
and what comes out of those equation,
those are matrix multiplication and so on.
Matrix multiplication is
a solution to an algebra problem.
I think I will skip over this part.
This is just about these three Ps of vocabulary.
Scale, fork, and join,
that is the vocabulary of matrices.
If you think about matrices,
every matrix is either a one by one,
which means it's scaling
or it's a horizontal
or vertical juxtaposition of matrices.
That's not how I was taught to think
about matrices but I think is a very nice way.
If you have those three building blocks, one by ones,
horizontal juxtaposition
of matrices with the same height,
vertical juxtaposition of matrices with the same width.
All matrices, those operations are
exactly the scale, fork, and join.
The types guarantee that when you stack,
you can always stack matrices with the same width.
Or if you juxtapose them
horizontally, they have to be the same height,
that's in the types. Where does that come in?
Join requires that the co-domains are the same.
Requires the domains are the same.
Now.
So, at this point we're talking about matrices.
I would say, we tried functions,
it works out very nicely.
But, for high-dimensional domains,
it's too expensive to extract the data representations.
So, why don't we use matrices in the first place.
So, we can do that, but now
an interesting question arises,
which is if a lot of matrix multiplication happens,
everywhere we said composition
that gets implemented by matrix multiplication.
Matrix multiplication is associative.
Why is matrix multiplication associative?
It's maybe a funny question,
why is it associative?
You might say, "Well,
here's a proof that it's associative."
But, I want to say,
"No, that's the hard way."
Matrix multiplication is associative
because function composition is
associative and matrix multiplication
correctly implements function composition,
when the functions are linear.
That's a much more direct way to think
about compositionality and about matrix multiplication.
Matrix multiplication is solution of a problem,
which is how do you implement correctly,
composition of functions that are
represented by matrices, and then,
answer to that or the problem is matrix multiplication,
the associativity follows from
the original motivation for making
it the first place, not after the fact.
So, matrix multiplication is associative,
but different ways of associate give you
different performance, very different performance.
So, that means if we're going to use matrices,
we want to be clever about how to associate.
The definitions that I showed you are not
at all biased to how
to do the compositions of linear maps.
Whatever composition structure you give originally,
it's going to maintain that structure.
Some of those composition will be
efficient and other ways that you might have,
but didn't compose it, would be a more efficient,
more or less efficient, So,
what we'd like to do, though is not rely on
the programmer to write the compositions
in a way that's efficient for
automatic differentiation because a program
has a different thing he's trying to achieve
which is clarity and modularity.
The compiler should then,
figure out how to make it efficient,
not the programmer, because those two are at odds with
each other efficiency and
modularity are odds with each other.
So, fortunately there's a simple way to,
I don't think it's simple, Thoughtially,
there's a way to take any composition
of matrices and generate
the optimal capital re-associate
into something that's computationally optimal.
While it's not simple, there are couple of algorithms.
One's the n-cubed algorithm that
does dynamic programming and then,
there are these much cleverer,
subtler algorithms to do in log in time.
Where does this smarts going to go?
Here's the cool thing, the cost of
matrix multiplication depends only on
the dimensions of the matrix, not the values.
If you want to be really clever and talk about
special matrices that you in a way,
you can predict, they're going to be
zeros and that's not the case.
Let's say, you're not talking about special matrices.
So, the optimal composition
as multiplication of matrices
depends only on the dimensions.
Their dimensions depends only on
the types in this setting.
So, the dimensions as part of
the types and that's compiled time information,
because it's a statically typed language.
So, we can do the optimal
re-association at compiled time,
not at run-time, that's a huge win.
There are two common associations that are made.
If you associate all the composition to the right,
you get what's called forward mode ND,
you associate on the left what's called reverse mode AD.
For any kind of
gradient-based optimization and
reverse AD is significantly better.
How much better? It depends on the domain of dimension.
For machine learning these days it's huge,
it's several millions.
So, how can we do this optimal reassociation?
Well, as I said, there are these algorithms,
but there's another way, which is,
if we know, if we just assume that the result,
that the final codomain,
the final result type is low-dimensional.
Particularly if it's one-dimensional,
then I think it's the case
and I don't know this for a fact,
I think it's the case that full left association
in the large reverse node is always the best strategy.
I'm not sure that's true,
I am sure that it corresponds
exactly what's in a machine learning,
that's a back propagation does.
So, how can we think about getting
this left association automatically
and there's this nice technique which
is related something it's
called continuation passing style.
So, the idea is the following:
If I have a function like going from a to b,
this can be linear maps or regular functions,
so on, I can represent it by something that says,
"If I know how to consume b,
I can give you a way to consume a."
So, these two are equivalent representations,
you can transform either one to
the other without loss of information.
So, if I have a direct way of going from a to b,
there is an indirect representation that says,
"If I know what I wanted to do with b,
then I would know what I wanted to do with a,
which is apply this thing and then consume the b."
So, given an a, we're going to compute
the b and then consume it.
So, that's the meaning, given a function f,
we kind of interpret f. So, that's f,
we interpret it by as something that composes with f on
the right and some consumer
of the resulting b value on the left.
That's a continuations come.
If you make this transformation,
package it up to the category,
then no matter what composition style
you used in the original expression,
it's going to turn into fully
left associated and the representation.
You need a way to kick it off and that's just
by supplying the identity continuation.
Then, one point I want to make is, I think a very,
very big performance win
here from this perspective, which is,
remember we're going to be computing derivatives
of a function at a point
that's input and when f is higher-dimensional,
say in its domain and co-domain,
on its input and output, then that result,
is going to be something large,
therefore, represented on matrices is going
to be two dimensional, it's going to be
quadratically large. A lot of times, these
internal matrices are quite sparse, it could be
the zero matrix that actually happens fairly frequently.
So, very sparse, sparse means how many elements,
well the density is how many elements are non-zero.
Zero is the sparsest you can get,
there's also quite common is the identity matrix,
well that's also extremely sparse and you think
that's the density of
the square root of the number of elements.
So, these matrices can be very
sparse and then that
means that you're wasting a lot of computation,
if instead of just computing them,
we compute them composed
with a continuation that we're getting in,
then we can bypass
the construction of the matrices entirely,
then we get very small computations.
I'm not sure, but I think this is probably a very large,
when performance wise, it's also very easy to implement.
>> So, Conal, it's only 15 minutes
left. Make room for questioning.
>> Yes, absolutely, yes, thank you.
So, this slide and then a couple of
more is the punchline.
>> Yes.
So, this is a continuation category.
As I say, we've got
some underlying category, totally arrow,
and we've got some ultimate result type r. So,
this r is going to be the ultimate codomain.
We're just going to pack it up
this notion of continuation,
and now we set up another algebra problem,
which is how do we interpret
or say a function here in linear map?
In this category, well, we just do what I said before,
it's just we're going to compose on
the right with the function.
The left is going to be the continuation.
Now, we say cont has to preserve structure.
Meaning those same five equations,
always the same five equations it maps.
It did compositions to compositions and so on.
When we solve this algebra problem
with the help of some isomorphism,
we get this definition.
So, this definition is
a reverse-mode automatic differentiation,
this code is reverse-mode.
It's much, much simpler
than anything I've seen in literature.
Literature talks about graphs and talks about mutations,
partial derivatives, none of that is necessary.
Okay. All that's necessary is
this little continuation trick,
and that trick is annotated on my differentiation.
It's just something that would be in a library.
You're programming with this vocabulary,
we're going to have a library.
It's going to have continuation and it becomes
useful for all kinds of things.
So, this is the reverse-mode AD.
It's a general AD category parameterized by
a continuation transform version of matrices.
>> Triple M subscripts.
>> Yes. So, matrices is indexed by the scalar.
Continuation is indexed by, well,
in this case, matrices and the final result type,
and you plug that into the general AD thing
which is no more complicated than the specialized AD.
So, this is a reverse-mode AD without tears.
Every time I read a paper
on reverse-mode AD, I want to cry.
So I think that's how they came up with this title.
Okay. All right.
Now, this is the final essential trick.
This is what gets backprop.
It's a more specialized version
of reverse-mode AD that's a little more efficient.
Here's the key idea.
If you look at linear transformations
from a vector space to its scalar field.
So this, you might be r the million and s will be r,
so it's the underlying scalar field.
So, in general, you can
have linear transformations between
any two vector spaces with the same scalar field.
But if the result of
the codomain is the scalar field itself, then u,
derivatives from u to that, scale to that,
that's called a dual vector,
and there's this lovely duality
property in linear algebra.
You will see it in terms of matrix transposition.
Matrix transposition is all about
duality. Here's the cool thing.
If you're interested in
linear maps that result in the scalar field,
they're isomorphic to u itself.
If u is finite-dimensional.
Every one of these linear maps
has a particular form which is
the dot product with a vector,
and then it's waiting for its second argument.
Now, here's the cool thing.
If we use that continuation representation.
So, this is the punch line.
We have linear maps from a to b.
We use the continuation transformation
and we get b consumer
to an a consumer, linear consumers.
But these guys are isomorphic to b and a itself.
So, it represent linear maps from a to b in
this continuation form with s,
the scalar field, but we can
represent that continuation form
by just functions from b to a.
Again, there's a very simple specification
that relates the continuation to the dual.
Again, we set up the algebra problem
with five equations and we solve it.
Now, we get this incredibly lovely form.
This gives you the implementation of a dual category.
Now, we can see, wow,
the identity is the identity,
composition is the flip composition,
the extractors are the injectors,
the injectors are the extractors,
fork is join and join is fork,
and scale is itself.
If you think of these operations,
fork joining scale as being matrix building operations,
then what we've done is we've described
that to the transposition of matrices.
So, here's backpropagation.
So, no mutations,
no variables, no graphs, no partial derivatives.
It's just the general AD
where we use the dual of additive functions,
and no matrix compositions here.
It's extremely efficient.
Then, I just have a bunch of examples,
but I think I'll pause and take questions.
If there are questions about examples,
I'll show them and otherwise it's not.
>> Why was the dual of scale to scale?
>> Yes. That's funny.
So, the dual is not the inverse.
For the inverse, if you want
to scale it like the inverse.
So, It's really-
>> I think of the transpose.
>> I can give you a variety of answers.
If you want think about matrices and you think
the transpose are one by one, you get itself.
But I'm not satisfied with that answer because it
leaves the question of why do you use
transposition in the first place? What does it mean?
What does transposition have to do with linear maps?
The duality and
the continuations instructions that I showed
you are exactly what motivates transposition.
So, I can say it's because
transposition preserves one by one matrices,
but that's a backwards answer from my perspective.
Transposition is about duality,
not duality is about transposition.
>> So, I learned about automatic differentiation through
the framework of the commutative ring
of dual numbers over r. Now,
I was wondering if you could relate to my ideas of
your work to the language of dual numbers.
It seems to me that your fork idea of take combining
functions derivative is exactly
appending a dual component to a function.
>> Yes. So, as I said there,
commonly AD is done in
two flavors, forward and reverse-mode.
Forward-mode is fairly easy and
convenient to express in a lot of
programming languages especially if
you have operator overloading.
A way to do that is to everywhere you have numbers,
instead use something that's
sometimes called dual numbers,
which is a pair of a number and a second number.
The second number represents the derivative.
So, it's very much like you see in my original type by
the function that gives you a pair
of number in a linear map.
>> Exactly, yes.
>> Okay. So, yes,
you can explain the dual number idea
in terms of this categorical form.
But of course, if it's a pair of numbers,
then you can only talk about
differentiation of scalar function.
All right. So, it's much, much less general.
It's also wrong the way that forward-mode AD is done.
Well, it's correct only because people are careful.
If you look at forward-mode,
usually is you end
up taking functions from dual numbers to dual numbers.
A value anti-derivative to a new value anti-derivative.
But that transformation of
a number to derivative to a number derivative,
the only some of them are sensible.
Because the value out cannot depend on the derivative in,
and yet the types of it can.
So that to me is a fundamental flaw in this whole idea
of dual numbers and the whole way that a forward-mode AD.
Meaning you have to be careful
that these functions are dual number to dual numbers.
Some of them represent sensible differentiation,
some of them don't.
If you refactor it and say, "No,
the value out cannot depend on the Delta in."
you get my representation.
Which as you take a single number in,
you get out a single number as
the value and another function from a Delta to a Delta.
So, if that representation,
the number at the value out cannot
depend on the derivative in. Yes.
>> Makes sense.
>> So, that's something I didn't appreciate until I asked
myself exactly the same question because I did
this forward mode as well,
and I wondered, "Do these two relate to each other?"
That's when I realized, "Oh, this is
really unfortunate aspect of the forward mode."
>> Let' see.
If there are no other questions, I'd want more.
Could you go back to your slide,
see the axioms for a category,
in a Cartesian category.
Immediately after you say that your implementation of
your D hat factor satisfies this rise.
>> Yes, it does maybe.
>> Yes, okay. So, under the next slide, right?
You have your implementation of D hat, right?
>> Yes.
>> I was wondering from
what axiom on the previous slide
does the Leibniz rule come from?
>> Oh, that's a different.
>> It doesn't. Yes, well,
that's the numerical category.
>> Yes, exactly. Exactly. So, it's
an unfortunate thing about
the way differentiation is taught in calculus classes,
and the way that symbolic differentiation is done,
and now automatic differentiation is done.
There's a really unfortunate thing which is that,
you look at the rules you learned in,
rules I learned anyway in differential calculus
and it said things like,
derivative of a sine is cosine, but it didn't say that.
It said that the derivative of sine u is cosine udu.
All right. Your cosine u is minus sine udu.
So, for times the derivative of u.
So, and then multiplications,
derivative of u times v is the derivative of u,
the derivative of a product uv is the derivative of
u times v plus the derivative of v times u. All right.
Every one of those rules
has the Chain rule embedded in it.
Every one of those rules is
the Chain rule plus another rule.
So, by using this categorical approach we said no,
the derivative of, is in same sine,
derivative of sine u is cosine udu,
derivative of sine is cosine,
plus we have the Chain rule.
So, we don't need to have
these kind of complicated rules,
every one of which has the Chain rule embedded in it.
>> Can you say anything about,
did you exploits these insights,
better circuits or was there [inaudible].
>> Yes, yes. Yes, yes.
So, well, I'll tell you what I do and you can tell
me that if I'm exploiting or what you don't mind.
So, as this plugin I have,
it's a plugin in Haskell Compiler lets call it GFC.
>> All right.
>> I thought there's a GFC, it's a Haskell Compiler.
I wrote a plugin in it that transforms
the internal representation which is
the type Lambda calculus called core.
Transforms the regular Haskell code into
categorical form and then
instantiates it with respect to
whatever category you want,
for instance, differentiable functions.
There are rewrite rules,
most of which are just specified and just
Haskell code and the rewrite rule sub-language.
Those rewrite rules are true of all categories,
if they involve this category language or Cartesian,
if they involve a Cartesian and so on.
So, a lot of optimizations in there that work for
all interpretations, including differentiable functions.
>> Do you have any ideas that you've been taught?
Say, different circuits and compiled it within Haskell.
>> Yes. So, what I actually did.
>> I actually tried it keeping it in circuit but.
>> Yes, well, I don't mind saying circuits.
This whole work was was motivated
by compounding Haskell to hardware.
That was my job a few years ago.
I worked at a hardware company that had
this really cool dynamically reconfigurable hardware.
So, it's like an FPGA but it reconfigured
it two billion times a second that are like once a year.
So, reconfigured as it ran.
So ran it to billion times a second,
two gigahertz, and it
reconfigured itself at two gigahertz.
There's a huge win from doing
this kind of runtime optimization.
So, my job was.
>> Do we see all this addition come in?
>> Originally for things like FPGA
it's useful like running in a router,
in checking your packets for Russian virus or whatever.
>> This would be the reconfiguration.
What would require the reconfiguration
to have in circuits?
>> Reconfiguration in this setting
three-dimensional circuits are much
more efficient than two-dimensional circuits,
can be, I mean.
Pretty much on circuit can be much
more efficient that two-dimensional circuits.
Why? Because most of the power and time
taken up during computation is moving electrons.
They have to move from one place to another.
Pads in three-dimensions are much,
much shorter than a two-dimension.
So, if you can make a three-dimensional FPGA,
you can get much better performance.
But it's really hard to make a three-dimensional FPGA.
So, this fellow, Steve Teigen,
has a wonderful idea, this inside,
this man is brilliant and I knew like
30 seconds I wanted to work with this man.
He realized that I can implement
a three-dimensional FPGA by
implementing one dimension in time instead of space.
That's the point of dynamic reconfigurability.
So, you get much, much shorter pathways in space-time,
and the architecture was called
space-time, and it really,
honest to God, is based on
Minkowski's mathematics of relativity.
Minkowski used to clean up the math that Einstein used.
So, my job was, initially,
it would be FPGA applications
but eventually it'll be all computation.
How are we going to program this thing?
It's not going to be in any sequential language,
every sequential language is going to be deeply
acknowledged because this is massively parallel,
we want to be doing
millions of operations simultaneously.
Meaning, two billion, million operations per second.
So, it's not going to be any sequential language.
But a purely functional language
doesn't have this sequential bias.
So, how do we compile the purely functional language into
a three-dimensional FPGA? That was my job.
I hit on this solution which
is I realized one day that you
could describe circuits using
this language that I've shown you.
This language is a language
that can be used to describe circuits.
Hardware engineers don't do that but you could.
Then, I remembered, wow, this 1980 result that says
the type Lambda calculus is
equivalent to this vocabulary,
and Haskell is equivalent to the type Lambda calculus.
So, if I go from Haskell, put it through GFC,
and go to its internal language,
which is a very small Lambda calculus,
I take that internal language and
transform it into the language of categories,
and then instead of instantiating with
usual one by all computable functions,
I do the category of circuits
of massively parallel circuits.
Then, I represent that in terms of
the graphs as a rendering of that representation,
and then I generate Verilog code.
Eventually, we were going to do things that are
better than Verilog but that's where I started.
That really was my original motivation.
But then one day I realized because I'd
worked on ADL a while ago,
I realized that the categorical approach is a much,
much better way to think
about automatic differentiation, so.
All right. I think that's a good stopping place.
>> Yes, thank you.
>> I'll be around.
Không có nhận xét nào:
Đăng nhận xét