AIデー向けお勉強シリーズ② ジェームス・ドウマさん(機械学習エキスパート)
ジェームス・ドウマさんです。
オートパイロットの初期バージョンから、FSDのいくつかのバージョンに至るまで、中身をハッキングして機械学習の観点から語れる人は、社外ではこの人以外に存在しないでしょう。
この2人の対談動画は16本もあります。
重複があるとはいえ、すべて中身がパンパンに詰まってる動画なので、一筋縄ではいきません。
この動画の視聴目的は
バーズ・アイ・ビュー・ネットワーク(BEV-net)(鳥観図ネットワーク )
のある程度のイメージをつかむことです。
BEV-netはテスラ車に実際に搭載される推論用NNにおいて、カギとなる存在です。
車両の個々のカメラに実装されているNN
それらをフュージョンしたNN(BEVの1機能)
テスラ社内の開発環境としてのBEV
など、車両レベルで起こっていることと、企業内、データセンターで起こっていることを分けて整理しないと混乱しがちです。
それにしても DOUMA さんなんで日本語しゃべれるのだろう…
この対談の個別トピックは以下です。
・APにおけるステディな改善
・テスラ流のNNへのアプローチ
・NNの巨大化していく様子
・カメラごとのビジョン・NNネットワーク(バックボーン)
・目的変数(従属変数)の大幅な増加
・画面上でのアウトプットの大幅な増加
・アウトプットに基づいて実行されるドライビング・ロジック
・ドライビング・ロジック=コントロール・コード
・カメラの生データに対して、NNが行っている処理
・NNによるパーセプション
・3つのステップ
・パーセプション
・プランニング
・コントロール
・コントロール用プログラムは人間が書く
・プランニングは、人間が書いたコードとNNが生成したコードのミックス
・プランニングは意思決定を含む
・FSDに表示される黄色のsplineはプランニング機能を示している
・プランニングはセンシング(パーセプション)の次の段階
・車両レベルでのNN推論の大部分はパーセプションに費やされている
・車両レベルでのNN推論は、外部環境の理解・知覚が大部分
・目的変数の増大
・目的変数の細分化
・問題ごとに常に新しい機能を開発
・FSD導入時点における飛躍
・NNが巨大化するということは、インプットデータにたいし、より多くの処理を行っているということ
・NNの大規模化は目的変数の増加を可能にする
・大規模化により、通常は出力データの精度(アキュラシー)が増す
・出力データの正確性は確率的に測定されたもの
・目的変数の確率が100%に近づくほど、人間はコントロール・コードが書きやすくなる
・コントロールコードにとっては、確率が100%に近いほど望ましい
・かつては一つのカメラに一つのNN
・NVIDIA GP106を使っていた頃の制約
・現在は一つのカメラに複数のNN
・信号停止時、高速走行時など複数のシチュエーションで、それぞれに応じた推論用のNNを切り替えているはず
・動的オプジェクト知覚NN
・静止オブジェクト知覚NN
・専門化された複数のNNも同時に実行されている
・カメラから見て、道路を基準として、それとの相対で動的か静的かを判断
・HD3によって、NNの数を劇的に増やすことが可能となった
・フュージョン・アーケティクチャーの説明
・個々のカメラレベルでの実行とフュージョンレベルでの実行
・個々のカメラNNによって2種類の出力データが生成
・フュージョン・ネット=コモン・ネット
・コモン・ネットワークへ向けてののデータ出力
・コモン・ネットワークによるフュージョン
・そのフューズド・データは、バーズ・アイ・ビュー・ネットワーク(BEV-Net)へ
・フュージョン→ローテーション(テンポラル)→BEV
・BEVはある種の「想像力」を獲得している
・想像力を獲得しているかのような質問を、BEV に投げることができる
・質問を可能にするプラットフォームとしてのBEV
・単純なカメNNは、見えないものを知覚することはできない
・BEVに与えられた2つの役割
・シンセサイズ
・見えないものの存在をguessすること(seeではなくguess)
・FSDでもそれが実行されている様子を見ることができる
・FSDで見えないものをguessしている様子
・BEVは時間の推移も考慮することができる
・各画像を大きなシングル・フレームに統合し、それらを並べることで時系列による分析が可能となった、それをBEVへプロジェクションする
・ハッカーコミュニティの解析結果とアンドレア・カパーシーがSCALED ML で語った内容はほぼコンシステント
・NNの大規模化→BEVにおけるトップダウンビュー関連の機能開発→時間的統合
・BEVの持つ2つの役割の言い換え
・1、 各カメラ画像のリコンシリエーション
→a way of asking the car to reconcile all the different cameras
・マシンへの命令はスペシフィックでなければならない
・マシンが答えを生成できる環境と形式を用意して、マシンに質問する
・(暫定的な)グラウンド・トゥルースの決定 → 損失関数を供給できる
・誤差情報のフィードバック → ネットワークの改善
・2、BEVのもう一つの役割→domain easy to write code or software in
・NNとプログラマーが対話するプラットフォームとしてのBEV
・各カメラ画像をBEVではなく、3D空間へプロジェクションするという方法もありえたはず(車両レベルで)
・BEV空間へのプロジェクションではなくて
・しかし、3D空間への直接のプロジェクションは難易度が高い
・NNに、直接に、世界は3Dだと理解させる方法は、多くの複雑な処理を必要とする
・NNはブランク・スレート
・NNはすべてのアウトプットを、入力データのみから生み出さなければならない
・欲しいのは現実の詳細なモデルではなく、あくまでシンプルでパワフルな、現実のレプリゼンテーションが欲しい
・BEV空間へのプロジェクションで、カメラと物体の位置関係を、NNに認識させることは可能である
・3D空間へのプロジェクションよりも、BEV空間へのプロジェクションの方がシンプルであり、FSDの目的のためにはその方が相応しい
・BEVへのプロジェクションは、現実空間におけるheightやvolumeをある程度捨象している
・(乱暴に言えば「背が高い物体だろうが、低い物体だろうが、車両とぶつかっちゃいけないことには変わりないのだから、高さが同じキューボイドに放り込んでしまえ」「ぶつからないために重要なのは、幅と奥行きであり、高さではない」ということ)
・以前と違ってコードを直接クラッキングすることは非常に難しくなっているので、アウトプットからNNアーキテクチャーを推測するしかなくなっているが、いくつかの方法はある
・BEVへの信頼度の増加
・テスラは過去のNNを捨てているわけではない
・徐々に過去のNNを改良しつつ新規のNNを追加していく
・プランニングのコードを書いていたプログラマーが、BEVを追加する過程においても、過去のNNやそのアウトプットはそのままにしておいて、BEVをインテグレートしていった
・the field of view or POV(BEV以前のやり方)
・BEVによって2Dだったものが疑似3Dに
・疑似3Dを並べることにより、時間性考慮した疑似4Dに
・ラベリングの必要性
・テスラもマップを使わないわけではない
・ただしそれはHD3Dマップである必要はない
・Accelerometers:加速度計
・1,ラベリングとその後の学習プロセスの大幅な自動化
・2,self-supervised training
・DOJOでNNアプローチが必要だとは限らない
・DOJO内部で3Dシーンモデルを構築する際にgeometric priors、geometric analysisで十分な場合もある
・まずは構築された3Dシーンの中で、人間が手動でラベリングしていく
・DOJOが構築した3Dシーンモデルの中で、その原ラベリングに基づき、プログラムが自動でラベリングを生成していく
・そのラベリング生成と、精度向上のの過程における学習において、NNを使うことは十分にありうる
・ラベリングは個別のカメラビューNN向けのラベリング、BEV向けのラベリングなど様々な用途が考えられる
・DOJOの中では、時間も含めて構築した3D空間の中で自由に往来ができる
・DOJOで鍛えたNNをフリートで展開して試す
・2,もう一つのDOJOでのトレーニング方法→ self-supervised training
・これにより「あるカメラからのシーンは、他のカメラからとらえた場合どのように見えるか?」を問える
・カメラ同士のそれぞれの視点からの「推論同士の誤差」が、エラー・ファンクションとなり、学習を推進できる
・カメラの視点はオーバーラップしているから
・もしくはDOJO内部では、この1フレーム後のシーンはどのようなものか、現在までのフレームに基づいて予測せよ、という問いを投げかけることができる
・DOJOにはすでに時系列3Dデータが取り込まれているので、問いの答え合わせができる
・DOJOで生成された3Dデータのアウトプットは、BEVでも利用できる
・より高い正確性のためには、more nuanced understanding of environmentが必要
・都会の走行では、通行人のボディーランゲージ、ポーズ、モーションさえも理解する必要がある
・膨大なコンピューター演算を可能にするためのDOJO
・ただしそれはFSD向けの演算に特化したシステム
・通常はコンピューターは想像力を持たない
・画像フレームの中に存在しないものを想像することはできない
・DOJOの4D空間の中で、あたかもNNが想像力を持っているかのような質問をすることができる
・「つい先ほどまで見えていて、現在は道端に停車しているバスの後ろに隠れているだろう歩行者を想像してください」
・バスに隠れているその歩行者をラベリングするためのフレームは存在しない。だって隠れていて見えないのだから
・バスに隠れる前ののフレームと、バスから出てきた後のフレームなら存在する
・FSDにおける想像力 (FSDで「想像」自体はかなりの程度まで実現されている)
・DOJOがその想像力の飛躍を可能にする
・場面ごとの予測は現在のFSDでも行われている
・交差点に差し掛かった時、自転車が近づいて来ているか、その自転車が道端で止まっているかどうかで、車両の挙動は明らかに違う
・聖杯としてのビデオ・オート・ラベリング
・DOJOでも最初は人間によるラベリングが必要であろう
・その後DOJOが自分でラベルを生成できるようになれば、人間の仕事はそのラベリングのベリフィケーションとなるだろう
・やがてそのベリフィケーションすらいらなくなるだろう
・DOJOは完璧な未来「予測」はできるが、現実に走行しているフリートは「推論」しかできない
・what three-dimensional dynamic model of that scene is most consistent with all the sensors saw all the way through the scene from beginning to end
→DOJOにはこれがわかる(アタリマエ)
・このグランド・トゥルースに基づいて、NNをトレーニングすることができる
・人を上回る推論能力をNNに持たせることが目標
・DOJOはトレーニングの主体ではなく、あくまでNNをその内部でトレーニングさせるための環境
・DOJO内部に構成されるのは、カメラからのインプットデータのみから構成された、時系列を持った3D空間であり、NNのトレーニングのために最適化されている
・現在のDOJOの進捗度
・DOJOインフラの、ハード&ソフトを構築している段階か?
・NNはセオリー・ドリブンではなく、エンピリカル・ドリブン
・NNのアウトプットを実験前から正確に予測することはできない
・DOJOのプロトタイプはすでに完成済みで、そこでの実験も行われていることだろう
・テスラがラベリングのために独自で開発しているツール群も、数千もの特徴を備えたものであろうし、それにより、ラベリング要員の生産性の向上に役立っていることだろうが、それは進行中のプロセスだ
・FSDはここまでで完成!といったプロダクトではなく、常にバージョンアップされ続けていくものだ
・テスラと同規模でテスラと同じ試みをしている企業は存在しない
・ラベリングツールを開発している企業も数多くあるが、テスラほどの規模に対応できるものはないだろう
・テスラはほぼすべてのラベリングツールを自社で開発しているはず、データ規模が大きすぎて、既存のラベリングツールでは対応できないからだ
・おそらくDOJO向けチップのファーストカットは完成させているだろう
・チップが完成しても課題は多い
・電源デザイン、冷却デザイン、コミュニケーションデザインなど大規模データセンターを構築するのは全く別の課題だ
・ただチップを完成させていれば、それを既存のデータセンターの枠組みの中で、ある程度の規模で運用することはできる
・しかしそれはDOJOと呼べるものではないだろう。
・DOJOの大規模運用が開始されたときには、NN計算コストが劇的に低下していくことだろう
・(Lidarに関しても多少議論されている)
Revealed: Inside Tesla's FSD Neural Nets w/ James Douma (Ep. 255)
DOUMA
the first version of the first Autopilot had a lot more code and very little in the way of NNs in it and over time that has expanded
over the last couple years I’ve seen little snapshots of the NN architectures
looked at what's the state-of-the-art for this particular architecture
like what are they able to achieve with a certain size network
What Tesla is doing is different from what researchers are doing with these network
so from the beginning I was looking at them from that standpoint
and up until FSD beta came out there's a pretty solid steady evolution we could see
the networks they would get bigger occasionally
they would change the structure of the way they were doing
the inputs
the system itself
you can think of it a couple different ways
one way of thinking about it is that
they have a vision network on each camera
what that vision network does is
it takes the video stream coming in from the camera
and it analyzes it
and then it produces a bunch of outputs for each camera
and the outputs might be
where are the stop signs in this frame
or where are the pedestrians in this frame
or where are the cars
or how far away are the cars that you can see
where are the stop lines
where are the markings
where are the curves
so a single NN they started out with a small number of variables
but over time as it's become more sophisticated
the number of variables that they get out of each camera has grown and grown and grown
and now there are like thousands of them literal
that they're asking for all these networks
for some things multiple cameras get used and the data between the cameras kind of interact
these NNs they're outputting a bunch of these variables
now there's more beyond those outputs
if the front camera, one of NN outputs is where are all the pedestrians in the frame
so the output of that is like a frame with little boxes
around all of where all the pedestrians are
say that's what comes out
so another piece of code has to take and make a decision on the basis of that
this is the driving logic
there's a couple of layers to this
there's the sensors themselves what comes out of them
and then what NNs are doing most is perception
which is taking the raw sensor outputs
and turning it into the kind of information
that you could meaningfully use in an human written program
for instance where are the pedestrians or
where are the center lines or
where is the car located in the lane
that's what i'm calling perception here
so taking the sensor input and turning it into something that's usable
after that there's planning which is given these inputs
what actions should I be taking
to pursue the goal that the car is trying to do
like if it's trying to drive down the street or if it wants to make a right-hand turn
it has to decide okay what are the things I need to do from this point forward
to achieve that goal
and then at the end there's
this layer called control which is
it takes one action at a time
it gets the car to do that
so the control stuff is it actually turns the steering wheel or activates the brakes
and so control is all written by people
the planning part is kind of a mix
at this point probably
i've seen some outputs from the cameras which are clearly intended for planning
they're not just what i'm seeing
the simplest one of those to understand is one of the things that NNs do is
they guess at a path through the scene that's ahead of it
the car is almost always moving forward
so in addition to here are the cars
and here are the lane markings and that kind of stuff
the NN makes a suggestion and it draws this three-dimensional spline
which is just a curve with a couple of bends in it
through the scene ahead of it and that's a recommendation for
where the car probably wants to go
this is the NN looking at this saying
this is probably the way forward
so on a curving road the spline would follow the curve
if you're in a lane and there's a car ahead of you
the spline might go around that car for instance
it'll make some suggestions along those lines so that's a planning function
it's not just a sensing function and
there are other things going on
that we can see in the camera networks
that are planning related
but the overwhelming majority of stuff is just answering questions about
what's the situation outside the car right now
the cars have been doing that for a long time
over time we've seen the networks get bigger and there have been more variables
some of the variables get broken up into small
you can tell that as they bump into problems in the development of particular features
they'll add additional variables to help them refine their understanding of some phenomenon
that they're trying to break down in a way that's easier for the code
that decides what the car is going to do
we saw a really big change when it went to FSD
first of all the network's got a lot bigger
there were a lot more networks and they were a lot bigger than they were before
now when a NN gets bigger that means you're applying a lot more processing to the input
because you're trying to generate more outputs
but a lot of times you just want better accuracy
so the bigger a NN is and the more you train it on
and the more computation you spend on a particular NN
the higher the accuracy of the output could be
so the outputs and NNs they're inherently probabilistic so
when it gives you that little screen and
it's telling you where the pedestrians are
it's drawing little boxes where it thinks it sees pedestrians
and each one of them has a probability associated with it
there's a 60 percent probability there's a pedestrian here
and I think there's a 90 percent here or a 95 percent.
of course what you want is for the NN to be as close to perfectly accurate as possible
especially when it's a really important question
and the bigger you make the network and the more training data
the more accurate those numbers will become
those numbers those probabilities will get a lot closer to 99 percent
and you won't see a lot of 50 60 70 percent things
because that's a problem for the people writing the code
what do you do if the car says I think there's a 60 percent chance
there's a pedestrian in front of you
do you break or do you not
you want the probabilities to get closer to 100 percent and zero percent as the choices
because that makes the programming easier
and it also means that you're going to have fewer overreactions or under reactions that
the vehicle does
and the bigger you make the network and the more data you train it on
the closer you get to that ideal of perfection of always seeing a pedestrian when they're there
always getting seeing the lane markings exactly right and so on
so that's one thing we saw
as we see more networks when we see them get a lot bigger
now previously I described it as if there was one NN per camera
and in the early days it was one NN per camera when they were running on the GP106 the NVIDIA GPU
they had a limited amount of processing power
so they didn't have the luxury of being able to run completely independent networks on every camera
for the purpose of getting different kinds of things
and now what we see is they do run multiple networks on different cameras
and probably the networks are somewhat context dependent
like they might switch networks
depending on whether they're stopped at a light
or driving on a highway or trying to maneuver through an intersection or something
but we also see the networks get kind of specialized
like their networks
that are looking for moving objects
because the moving objects have certain things in common
and if you build a network that just looks at moving objects
for instance the time domain aspects of that kind of stuff are important
then there might be another network for static objects which is to say
things that aren't moving relative to the road
like stop signs and trees and the road itself and curbs they don't move
so we see a proliferation of networks
where a single camera might have two three four
or more networks that all run on and all these networks run in real time
DAVE
I’m wondering what's the state of camera fusion are
and are the NNs being applied to that fusion view
or is it still being applied to individual cameras
DOUMA
they have a fusion architecture
where they take a bunch of cameras
so the individual networks still are still producing the outputs
they also have a really big output
that feeds into a common network
that brings all these cameras together
to create like a single fused view of that
and then the fused networks go into a bird's eye view sort of network after that
and what the bird's eye view network does is
it asks the car to imagine what would the world look like
if I were looking down on the car from a great height
like give me a map of the car and its surroundings
so when you look at the display in the car
if you've watched the FSD videos
that's pretty close to what's coming out of the bird's eye view networks
for instance if you're driving down a road the bird’s eye view network would show the curb next to you
and it would show things that were in the median or on the sidewalk next to you
and things on the other side of them also
the bird’s eye network it'll also guess
if you're driving past a wall or if you're driving past another car
the bird’s eye network doesn't see
what's on the other side of the car which is occluded
the car can't actually see what's on the other side of the car
it'll guess based on what it sees in front of and behind the car
if it sees a curb go extending past a car it'll guess that the curb is extending through
and the bird’s eye network is asked to come to bring together the top-down view from all these different cameras
to synthesize a unified view
and it's also asked to guess about the things that you can't see
and you can see this on some of the FSD videos
where it'll guess sometimes incorrectly about things that it can't see
so when the vehicle is driving past an obstacle
the things on the far side of the obstacle they might vary
or you might see a pedestrian will walk behind a car
and then the network will guess that the pedestrian continues for a little while
and then the pedestrian will vanish because at some point the car's not the AP is no longer sure
if the pedestrian is still there
maybe the pedestrian stopped walking
or maybe they turned and they went some other way
the bird’s eye networks also incorporate time
the camera networks all come together to create a unified view of one frame
and that those get fed into the bird’s eye network
then the bird’s eye network looks back over multiple frames
now so this is something
that is get pretty hard to see in the networks
by the nature of how this stuff goes
so I don't get to see a lot of it
but Karpathy's talked a number of times about this at scaled ML
where he talked in some significant amount of detail
about the architecture they were using in the vision networks and
how they were doing inthe bird’s eye networks
so everything I’ve seen is consistent with what he talked about
so my sense is that
what he talked about on scaled ML a year ago
is a pretty accurate representation of what they're doing on the car
to short answer your question
the network's got a lot bigger
they added a lot of this bird's eye top-down stuff to the systems
and they've added temporal integration
so they're looking across time in addition to just static frames
DAVE
how do you think the FSD software
for example the bird's eye view seems like it's giving a broader view
but then you also have this forward facing view
let's say the main view of the forward-facing cameras
let's say something is happening
when you're going through a turn
there's some type of obstacle
the forward facing cameras sees that obstacle
the bird's eyes view might see something a little different
how does the software reconcile those two different views
and what priority does it give
DOUMA
the bird's eye view is a product
the car of course it can't see down from the top
it's got no way of directly perceiving that
so bird's eye view it accomplishes kind of two independent things
one of them is that
it's a way of asking the car to reconcile all the different cameras
because if you're looking down from the top
no individual camera can see everything around the car from the top
so if you're going to generate a bird's eye view
essentially you've got a little square map and the cars in the center
that's one way of asking the NN to fuse all the camera views
because the front camera looks forward
and the other cameras look to the side
and if you ask it okay now put all that together
and tell me what the whole picture looks like
you can train NNs to take almost any kind of input
and generate almost any kind of output
but you have to have a way of asking a question that's relevant
so that you're challenging the network to come up with outputs that make sense
so that when you train it
you're training it to makes sense
in the context of what you're trying to accomplish
maybe the most important thing the bird's eye view network approach does is
it asks the car to synthesize to put everything together into a picture
that makes sense
you're asking a question that forces BEV to reconcile the multiple overlapping camera views
because if you don't challenge it to do that
it won't learn to do it
so you want to ask a question that's the simplest question you can ask
that at least includes the thing you want
if what you want is to integrate all the camera views into a holistic understanding of what the car's environment is
one thing you can do is
asking the network what would it look like
given all these camera inputs
what do you think it would look like
if I was looking down on the car from the top
so they're asking that question
and that's something they can answer
so that they can determine a ground truth
and provide an error function to feedback to the network
to challenge it to get better
so that's one thing
the other thing that a bird's eye network does is
it's a domain that's easy to write software in
so imagine that you have to write the software to control the car
if I have a camera view forward
and you've got a pedestrian in front of the car
you have to guess how far the pedestrian is
just having a pedestrian in front of the car isn't sufficient to make a decision
about what you should do
you're driving on a curve and that pedestrian's on a sidewalk
when you ask the NN to create a bird's eye view
you're also generating an output that's an easy output
for a programmer to write rules on
because a programmer can look at the bird's eye output
and he can say okay tell me where the road is
so here's the road
and you can ask the question
is this pedestrian in the road are they not on the road
it's not like “Are they in front of me or not”
once you've asked the NN to create this map of the environment
now your programmers have a map to work with
to make decisions about how they want to control the car
so you're getting two things out of the bird’s eye network
one of them is you're getting a straightforward framework
for fusing all these cameras together to get a kind of holistic view
which is way of asking the network to reconcile what it sees on different cameras
and then you're also getting an output that's actually useful
because the people who are writing the planning code and the dynamic control code
they now have a representation that they can work with
it's easy framework for a human programmer to work with rules inside
so bird's eye networks are a very clever solution to that
you could imagine another thing
I imagined that they were trying to make a full three-dimensional sort of virtual model
the world is three-dimensional
a vehicle that's in front of you it has a height a width and a depth and
it occupies some position relative to you
and so as a human being when you're sitting in a driver's seat
you see another vehicle
you have this sense of that thing in space out ahead of you
that you're in a volume of space and
because that's a simple and accurate representation of the reality that we're in
it's a good framework to be able to understand everything
and to work in
but the thing is
it's really challenging if you're asking the simplest thing
the NN could give you this completely all-encompassing depth with everything
which is complete description of the world
if you ask for that and if you challenge a NN to do that
eventually it'll be able to do it
but it's not the simplest thing you can ask that forces it to figure that out
it is just like asking NN that
well what would this scene look like from a different perspective than I am
that also requires a NN to understand that the world is three-dimensional
and other objects which occupy space separate from the vehicle
a NN it doesn't know any of this stuff
it's a complete blank slate
it doesn't even know what a child knows
when you start with it every single little aspect of what it learns about reality
is something it has to figure out from the data that you're giving it
so you have to ask it questions
that challenge it to come up with simple and powerful representations of the world
that you can also build on to write code to control the vehicle
getting a NN to understand that the world is three-dimensional is actually really challenging
we're giving NN a bunch of 2D images
though they're 2D projections onto a 3D world
but it's got to make decisions in 3D and somehow we have to stimulate it to understand that
it's looking at a three-dimensional world
it's not looking at four or five dimensions I guess it's four in a sense
it's a moving three-dimensional world and that's not at all obvious
in the way that we build the NNs
we have to challenge them to figure that out
and the bird's-eye view is a simple clever solution to stimulating the network to figure that out
because you only have a bird's eye view in a three dimensional universe
DAVE
let's say module or the planning code in these in FSD
are they relying more and more on the bird's eye view for planning
because before the bird's eye view
you didn't have that so you're relying more on just the camera view
are you seeing a shift over to more planning on the bird's eye view at all
DOUMA
so I’m mostly looking at NN architectures and
I have to infer what they're doing
based on the outputs
the bird's eye view outputs are going to be a lot easier to work with
than the field of view outputs are
I’m sure that they're making very heavy use of that
one of the things that we see in the evolution of these things over time is
when they had a new piece of capability come out
that we were going to see this discontinuous change in how they were doing things
it's never really been that way
they add new networks
they gradually transform the old ones
they consolidate old ones
I haven't seen them really get rid of anything so they still have all the
outputs that they had before
my guess is that
when the people writing the planning code
suddenly had bird's eye top-down stuff
they didn't immediately abandon the way that they'd been doing stuff
they started integrating it into the way they were doing planning
and over time they'll probably rely more and more on it
as they know they can trust it
and they figure out how to use it effectively
and then gradually the ways they were using the field of view or POV representations
will just kind of gradually go by the wayside
but bird's eye view is super powerful
and the integration they get the 4D stuff
they were in 2D before where they had snapshots
and now they're challenging the system to understand that it's a 3D world
and bird's eye view is a important component of how they're doing that
and they're asking NN to understand that
things evolve over time
if you see multiple frames in a row
if a block is traveling through a scene
and it's labeled truck and it's 90 90 90 80 90 you see it five frames in a row
well you're a lot you can be a lot more confident that the truck is actually there
if you see it in multiple frames
and so little variations in the probability
it doesn't affect your confidence that it's actually a truck
if you have to make decisions based on a single snapshot they do affect your confidence.
DAVE
how do they test and train like for example
for more stationary objects
the neural net will output different boxes or identify different objects
a cat, a car, a truck etc
then you could train it
you could show them the correct things and
you could go back and correct the incorrect object etc
to make the neural net improve but
with the bird's eye view
do you think some type of training is going on
in terms of correcting incorrect type of things
and how is that training being done
DOUMA
it's not nearly as straightforward as deep neural net training
if you show me where the pedestrians are in this photo
if I have a bunch of photos and I draw boxes around all the pedestrians
I can challenge the NN like this
just give me this output like
here's a pedestrian draw box around it
so I have a bunch of pictures with pedestrians which is pretty straightforward
but if I ask the NN like show me a top-down view
and now put boxes around all the pedestrians
that's a lot harder
so you can do a certain amount of labeling
by pulling in other sources that are naturally top down
as a maps for instance
and this is where it starts getting really interesting
so you can try to synthesize a true top-down view of the environment
and this is when Elon was talking about
video labeling and using DOJO to train in video
and the way that it works is you have a car drive through a scene
all right you capture all the output
from all the cameras the accelerometers
all the other sensors and that kind of stuff
you take all the footage from all those cameras
you put it in DOJO
you put in a really big computer
and that computer walks that data back and forth
and figures out what the ground truth must be
and you don't have to use neural networks to do this
you can use geometric priors and other sort of more straightforward geometric analysis
to figure out what the three-dimensional scene must be in that situation
then you can have a human being look at that three-dimensional scene
on a computer in 3D and say
this is a pedestrian
this is a fire hydrant
these are the lane lines
once the computer's got those labels
it can go back to all the frames that were used to make that scene
and it can label all of those inputs and
it can tell you because it's got the whole 3D scene built inside
if I was looking straight down on the car from the top
this is what I would see it at each instant in time
and you can create this three-dimensional model of the thing
then you can automatically generate all the labels
that you need for training
not just the cameras but also the bird's eye view
for instance DOJO can do a bunch of geometric back-end work on a stream of data
where it knows exactly what happened from the beginning to the end
and it can go back and forth over it a bunch of times
and throw a lot of computation at it
and eventually figure out what the 3D scene is
and generate all the labels what the car has to do with it
we're training a NN to figure out what DOJO can do
with a great deal of computation in the back end
and then DOJO can go figure out all this stuff to create the labels
and then we challenge the NN to do this on the fly while it's driving
so that's one way of doing it
there's another technique which Karpathy also talked about in scaled ML
which is a self-supervised training
in self-supervised training
you do a thing where you challenge a NN to tell you
what a scene looks like from one camera
when it's seen from another camera
or you challenge a NN to tell you
that you're the car is driving down a road at 30 miles an hour
and you look out one of the side cameras at the side of the road
and you see a scene the cameras take 36 frames a second
so 1/36 of a second forward in time the scene
will have shifted slightly
and I can ask the NN tell me what the next frame looks like
you can ask the NN to predict what a different camera would see
or what it will see at some point in the future
those techniques they're called self-supervised because nobody has to label the data
the system supervises itself
it generates its own inputs now
the inputs you're testing against aren't quite as meaningful to a programmer
because I’m not asking it tell me where the pedestrians are
that you're seeing out the side camera
I’m asking it tell me
what the camera will see in 30 milliseconds in the future
but the thing is
in order to be able to do that trick of predicting what it's going to see a moment in the future
or predicting what another camera will see at the same moment in time
it has to figure out a lot of stuff
about what the scene really looks like
and so that's a different trick that you can use
it produces kind of different outputs
it'll give you some of the geometric understanding of the scene
which the bird's eye view also requires
and which the bird's eye uses in a different way
so these things work together
when Tesla started doing this
you could just do scene labeling and you could just ask it
where the pedestrians were in the frame
as the systems become more complicated
and they're looking for greater and greater levels of accuracy
more nuanced understanding of environment is needed
an important thing when you're driving down the road
is what's that pedestrian going to do
so you see a pedestrian standing at a curb
this is a big problem when you drive down the street in San Francisco
you're constantly driving about two feet from a bunch of pedestrians
and there's a lot of difference to how you behave with a pedestrian
who's walking along the sidewalk towards you
and one who's looking at their phone standing on a curb
and you're wondering
if they're going to step out in front of you
so eventually the NNs will have to understand all that stuff in real time too
they're going to have to be able to read the body language and pose and motions
of pedestrians
as well as other vehicles and cyclists and that kind of stuff
so that they can predict what that person's going to do
and take appropriate action in San Francisco
you have to predict what pedestrians are doing and that's a pretty hard challenge
so as the challenges that the network needs to be challenging
as the predictions become more and more challenging
we have to get more and more clever
and not just ask it one way of getting the data
but ask it a bunch of different ways
so eventually when they've got DOJO working
they can really throw a lot of computer power at this
they'll be able to do a lot of that drive-through scene video
three-dimensional labeling back propagate all that stuff
that's really computationally intensive
DAVE
correct me if i'm wrong
with video labeling
let's say my model 3 goes through a scene
and right now it's using the NN
in my car to identify objects giving it to planning and then control
but with DOJO let's say
we take that video of the scene
we give it to DOJO
DOJO takes it through a super computer
crazy amount of computation doesn't necessarily have to be neural net
or machine learning
but it could be geometric as well and comes up with pretty much a 3D picture of that scene
and then a human labeler can go through that 3D picture scene
label the key objects
the car moving through
the person crossing the street
the dog going
then we can take those labels and then train back the neural nets
so that neural nets are more accurate in how they perceive that scene
so we're using the 3D constructing through DOJO
to create a more accurate picture of the environment
as we label it but then using that to train
the neural nets in the car
so in a sense you're giving these neural nets like super power meaning
they're on a different level now because they're using not only
what DOJO has constructed as the environment
but also you do this tens of millions over scenes
and you train the neural nets to identify the object as good as what let's say DOJO would do
you're giving NN an immense amount of increased accuracy
is that kind of the gist of kind of DOJO and video training
DOUMA
you get a lot more accuracy
because essentially you're gonna have a lot more data that's a lot more accurately labeled
so that's one thing that you get out of
but another thing that you get is
if I just label the pictures
and I’m just asking it okay tell me
a human says this is where all the pedestrians are
and I give the unmarked image to a NN
and okay you tell me where the pedestrians are
and then I compare that to reality
I compare that to what the human said
and I generate an error function
I propagate back to improve the network
but one of the things you can't do with that approach is
I can't ask the system to imagine something that's not in the frame
and require it to do a good job of doing that
when we start fusing the networks together
and we build the whole three-dimensional scene
I can start asking
I've got a pedestrian that walked behind an occlusion
a tree a bus or something
and I want the NN to understand that the pedestrian is still there
it's still walking along
I can ask it to imagine that the pedestrian is there
even though I don't have a picture to label
so DOJO can build the whole three-dimensional scene
including objects that are temporarily occluded
and I can ask the NN
tell me what's behind that bus right now
or tell me what's behind that car
because if it's a moving object a human will know this
if you see a car parked across the street from you
and another car transitions through the intersection
you understand that when the car's in the intersection in front of you
the other car is still at the stop light on the other side
humans know that
but neural networks that trained in a simple way, they don't know that
but with the DOJO approach they can do that
because I’m asking the NN to tell me the whole three-dimensional scene
including the stuff it can't see right now
I was super excited the first time I saw the FSD videos
you could see that the network was labeling stuff it couldn't see on the other side of obstacles
because we were giving them imagination
DAVE
so with the whole 3D creation through DOJO
you're mapping these objects and where they're going to be in the future
where they're headed
in a sense predicting
where they're headed through that whole scene but then teaching the neural nets
that movement
you're teaching them how to see ahead
DOUMA
that's the next step
understanding the situation instantaneously
which might include that
where the car is right now and here's its velocity
that's also an instantaneous thing because the vehicle has a velocity
another thing is tell me where the car is going to be in 100 milliseconds or 200 milliseconds
where is that car going to be when I’m 50 feet forward right
that's another level
beyond what we're talking about right now
they are doing that already
you can already see in the FSD videos
the car behaves differently like when
if it approaches an intersection and there's a cyclist coming
it behaves very differently than if the cyclist is stopped at the side of the road
there's already evidence that AP is looking at a scene
and it's predicting what the various dynamic objects in the scene are going to do
and responding to that
in the long run it has to be really good at that we expect that of humans
if you're going to pull out in front of another car
you need to have a sense of where
that car is going to be
when you can get done accelerating up to speed
is my path going to cross that other car's path
at any point in the future
if I do this maneuver so you have to be able to project both your path forward
into the future and the other thing and
understand if there's going to be any undesirable interaction there
DAVE
Elon Musk was saying that video auto labeling is the Holy Grail what does he mean by that
DOUMA
where you have a car drive through a scene
you take all the data that comes off of the car
you stuff that into DOJO
so DOJO recreates the 3D scene
you can auto label probably almost all that stuff
if so DOJO also has access to these trained NNs of the previous version of the car
so it can run through that scene with those networks and it can do a first pass guess
at where all the stop signs are
and as the networks get better
it's going to be doing a really good job of that
it's going to be 99.999 right so
when they first build DOJO and if they're first building these 3D scenes
they'll have to label a lot of stuff
there'll be a lot of details that DOJO isn't getting
but as it gets better DOJO will be able to build this three-dimensional scene
and pre-label thousands of things in the scene and
so then the human labeler's job will be mostly just verifying that DOJO is right
and of course at the tail end of that you don't even need a human in that loop
DOJO can create vast volumes of labeling data
and then you feed that into the NNs
and you close the loop with fewer humans in the process
right now their labelers are limited
they've only got so many labelers
and it's really labor-intensive
when DOJO is labeling it also knows the future
because DOJO's got the whole clip
the whole 10 or 15 seconds
so DOJO knows what the pedestrian was doing in the future
after your car passed the bus and looked back and saw the pedestrian was there
so whereas the NN on the fly
it obviously you don't know the future
but DOJO gets a whole complete thing
DOJO can run it backwards and forwards and can figure out
what three-dimensional dynamic model of that scene is most consistent with all the sensors saw all the way through the scene from beginning to end
and then we asked the NN at any point in time
to guess at the things the NN doesn't know
and eventually like people it'll get good at those guesses
say you have a pedestrian walk behind a bus
and you imagine the pedestrian keeps walking but maybe the pedestrian stopped
you can't know there's an inherent sort of uncertainty to that
DOJO can know what the pedestrian ultimately did
because it'll know after the car drove past the bus
that the pedestrian did in fact emerge from the back of the bus
and the pedestrians movement was consistent with walking
then DOJO knows what the pedestrian must have been doing
when they were included by the bus
of course the real fleet car will never be able to do on the fly
that because it just can't know the future
all the car can do is make a good guess about that but then that's just a limitation of reality
the networks will eventually get really good at predicting the things which can be predicted
but there will always be things that can't be predicted
you don't know if that the pedestrian trip behind the bus and fell over
that's just hard to predict
DAVE
how much of this effort do you think is Tesla
is already doing right now at this moment
creating 3D scenes using that as training is that
something that they're
venturing into right now or is this
something they're kind of waiting for until they get DOJO really up and running
DOUMA
They might be building the infrastructure to do that
and it'll be a work in progress for a long time
NN technologies are new enough
anything you want to do is pretty complicated
there's a good chance nobody's done it before
NNs they're very empirical it's not a theory driven domain
we have theories about NNs and why they do what they do
and they're not very good
so you can't use a theory to predict
if I build the NN three times bigger
and I give it this data and I include other data at the same time
well now what will my accuracy be? we can't do that
what we do is we build a rough sketch
and we test the idea to see if it makes sense
you can build a prototype that might be crude
the prototype will help you understand the benefit of doing it
so I think they did that
and they've probably got fairly sophisticated prototypes
now and they're probably building their way up the stack
their tools are going to constantly get better
the tools aren't one or two key features
there are thousands of small features that make the labelers more productive
and that improve the quality and quantity of the output
that you have available for training the networks
FSD is not a product like a toaster where it's just done one day
it'll just keep getting better for a long time
all the way along that every single tool in their arsenal
they'll keep refining as they go
they've probably already prototyped tools that they won't be using in production
for five years or three years or something
and they have other tools that they've been using
for a long time that they're still refining
DAVE
how cutting edge is this
Are other people other companies doing stuff like this
Is there anyone else doing this at scale
DOUMA
I don't think there's anybody doing what Tesla's doing at Tesla scale
there are certainly other people who train NNs
and use lots and lots of labeled data
and there are companies that are in the business of just making labeling tools
you have 500 labelers
and here's a tool that they can sit down at their desk
and it'll make them productive and help them avoid errors
so there's a market for those tools there are plenty of companies that are doing that
I kind of doubt anybody else is doing it at the scale that Tesla's doing it right now
I think they probably are building most of the tools that they're using
Because probably none of the commercial tools that are out there
can handle the scale that they're working at
so yes and no
other people are doing it
But I don't think anybody's doing it at the same scale that Tesla are
DAVE
do you think by the end of this year
Tesla DOJO will be up and running in some form or fashion
that it will make a significant difference to FSD and how accurate it is
DOUMA
so where is DOJO right now
they probably could have the first cut of DOJO silicon done
but DOJO is more than just they have a silicon chip
they want to make
that supports a particular computational architecture that they want
and Elon's already talked about the numerical format that they want to use
which is a numerical format that nobody else builds in silicon
so they're building their own silicon to do this
but to build a system at scale
that uses lots and lots of these chips requires a lot of power design
a lot of cooling design
communications is for these kinds of things is very complicated
and it takes a lot of work to get the communication networks to tile these things together
to build a big machine
and that is a bigger effort than making the silicon
on the other hand you can start using the silicon
if they've got their first version of their chip
they can run off a thousand of those
put them on motherboards go in the back room and pull a google
and use a regular computer racks and get that thing working and
they'll want to do that to start understanding how these things work together
and verify that the chip works and that kind of stuff
is that DOJO?
I think their aspirations are high enough
they want enough sophistication out of this thing
that there's a good chance that they haven't built a full up DOJO at this point
like a full rack of the final design
but they have early versions
my guess is that right now they can probably buy so much computation resources
that the hardware that they've built probably isn't moving the needle on it
will they do that this year?
maybe if they wanted to they could
I don't know if they'll have a final version of DOJO
when they get to where they start scaling DOJO then I think it'll matter
they'll very quickly get to a point
DOJO drops their cost of computation by an order magnitude
like out of the gate
so as soon as they get it for the same amount of money they're spending
they get 10 times as much back-end processing
and that'll move the dial on it for them
when they get that
maybe that'll be this year