AIデー向けお勉強シリーズ②　ジェームス・ドウマさん（機械学習エキスパート）

www.youtube.com

ジェームス・ドウマさんです。

オートパイロットの初期バージョンから、FSDのいくつかのバージョンに至るまで、中身をハッキングして機械学習の観点から語れる人は、社外ではこの人以外に存在しないでしょう。

この２人の対談動画は16本もあります。

重複があるとはいえ、すべて中身がパンパンに詰まってる動画なので、一筋縄ではいきません。

この動画の視聴目的は

バーズ・アイ・ビュー・ネットワーク（BEV-net）（鳥観図ネットワーク）

のある程度のイメージをつかむことです。

BEV-netはテスラ車に実際に搭載される推論用NNにおいて、カギとなる存在です。

車両の個々のカメラに実装されているNN

それらをフュージョンしたNN（BEVの１機能）

テスラ社内の開発環境としてのBEV

DOJO

など、車両レベルで起こっていることと、企業内、データセンターで起こっていることを分けて整理しないと混乱しがちです。

それにしても DOUMA さんなんで日本語しゃべれるのだろう…

この対談の個別トピックは以下です。

・APにおけるステディな改善

・テスラ流のNNへのアプローチ

・NNの巨大化していく様子

・カメラごとのビジョン・NNネットワーク（バックボーン）

・目的変数（従属変数）の大幅な増加

・画面上でのアウトプットの大幅な増加

・アウトプットに基づいて実行されるドライビング・ロジック

・ドライビング・ロジック＝コントロール・コード

・カメラの生データに対して、NNが行っている処理

・NNによるパーセプション

・3つのステップ

・パーセプション

・プランニング

・コントロール

・コントロール用プログラムは人間が書く

・プランニングは、人間が書いたコードとNNが生成したコードのミックス

・プランニングは意思決定を含む

・FSDに表示される黄色のsplineはプランニング機能を示している

・プランニングはセンシング（パーセプション）の次の段階

・車両レベルでのNN推論の大部分はパーセプションに費やされている

・車両レベルでのNN推論は、外部環境の理解・知覚が大部分

・目的変数の増大

・目的変数の細分化

・問題ごとに常に新しい機能を開発

・FSD導入時点における飛躍

・NNが巨大化するということは、インプットデータにたいし、より多くの処理を行っているということ

・NNの大規模化は目的変数の増加を可能にする

・大規模化により、通常は出力データの精度（アキュラシー）が増す

・出力データの正確性は確率的に測定されたもの

・目的変数の確率が100％に近づくほど、人間はコントロール・コードが書きやすくなる

・コントロールコードにとっては、確率が100％に近いほど望ましい

・かつては一つのカメラに一つのNN

・NVIDIA　GP106を使っていた頃の制約

・現在は一つのカメラに複数のNN

・信号停止時、高速走行時など複数のシチュエーションで、それぞれに応じた推論用のNNを切り替えているはず

・動的オプジェクト知覚NN

・静止オブジェクト知覚NN

・専門化された複数のNNも同時に実行されている

・カメラから見て、道路を基準として、それとの相対で動的か静的かを判断

・HD3によって、NNの数を劇的に増やすことが可能となった

・フュージョン・アーケティクチャーの説明

・個々のカメラレベルでの実行とフュージョンレベルでの実行

・個々のカメラNNによって2種類の出力データが生成

・フュージョン・ネット＝コモン・ネット

・コモン・ネットワークへ向けてののデータ出力

・コモン・ネットワークによるフュージョン

・そのフューズド・データは、バーズ・アイ・ビュー・ネットワーク（BEV-Net）へ

・フュージョン→ローテーション（テンポラル）→BEV

・BEVはある種の「想像力」を獲得している

・想像力を獲得しているかのような質問を、BEV に投げることができる

・質問を可能にするプラットフォームとしてのBEV

・単純なカメNNは、見えないものを知覚することはできない

・BEVに与えられた２つの役割

・シンセサイズ

・見えないものの存在をguessすること（seeではなくguess）

・FSDでもそれが実行されている様子を見ることができる

・FSDで見えないものをguessしている様子

・BEVは時間の推移も考慮することができる

・各画像を大きなシングル・フレームに統合し、それらを並べることで時系列による分析が可能となった、それをBEVへプロジェクションする

・ハッカーコミュニティの解析結果とアンドレア・カパーシーがSCALED ML で語った内容はほぼコンシステント

・NNの大規模化→BEVにおけるトップダウンビュー関連の機能開発→時間的統合

・BEVの持つ2つの役割の言い換え

・1、各カメラ画像のリコンシリエーション
　　　　　　→a way of asking the car to reconcile all the different cameras

・マシンへの命令はスペシフィックでなければならない

・マシンが答えを生成できる環境と形式を用意して、マシンに質問する

・（暫定的な）グラウンド・トゥルースの決定　→　損失関数を供給できる

・誤差情報のフィードバック　→　ネットワークの改善

・2、BEVのもう一つの役割→domain easy to write code or software in

・NNとプログラマーが対話するプラットフォームとしてのBEV

・各カメラ画像をBEVではなく、３D空間へプロジェクションするという方法もありえたはず（車両レベルで）

・BEV空間へのプロジェクションではなくて

・しかし、３D空間への直接のプロジェクションは難易度が高い

・NNに、直接に、世界は３Dだと理解させる方法は、多くの複雑な処理を必要とする

・NNはブランク・スレート

・NNはすべてのアウトプットを、入力データのみから生み出さなければならない

・欲しいのは現実の詳細なモデルではなく、あくまでシンプルでパワフルな、現実のレプリゼンテーションが欲しい

・BEV空間へのプロジェクションで、カメラと物体の位置関係を、NNに認識させることは可能である

・3D空間へのプロジェクションよりも、BEV空間へのプロジェクションの方がシンプルであり、FSDの目的のためにはその方が相応しい

・BEVへのプロジェクションは、現実空間におけるheightやvolumeをある程度捨象している

・（乱暴に言えば「背が高い物体だろうが、低い物体だろうが、車両とぶつかっちゃいけないことには変わりないのだから、高さが同じキューボイドに放り込んでしまえ」「ぶつからないために重要なのは、幅と奥行きであり、高さではない」ということ）

・以前と違ってコードを直接クラッキングすることは非常に難しくなっているので、アウトプットからNNアーキテクチャーを推測するしかなくなっているが、いくつかの方法はある

・BEVへの信頼度の増加

・テスラは過去のNNを捨てているわけではない

・徐々に過去のNNを改良しつつ新規のNNを追加していく

・プランニングのコードを書いていたプログラマーが、BEVを追加する過程においても、過去のNNやそのアウトプットはそのままにしておいて、BEVをインテグレートしていった

・the field of view or POV（BEV以前のやり方）

・BEVによって２Dだったものが疑似３Dに

・疑似３Dを並べることにより、時間性考慮した疑似４Dに

・ラベリングの必要性

・テスラもマップを使わないわけではない

・ただしそれはHD３Dマップである必要はない

・Accelerometers：加速度計

・DOJOが可能とする２つのトレーニング

・１，ラベリングとその後の学習プロセスの大幅な自動化

・２，self-supervised training

・DOJOでNNアプローチが必要だとは限らない

・DOJO内部で３Dシーンモデルを構築する際にgeometric priors、geometric analysisで十分な場合もある

・まずは構築された３Dシーンの中で、人間が手動でラベリングしていく

・DOJOが構築した３Dシーンモデルの中で、その原ラベリングに基づき、プログラムが自動でラベリングを生成していく

・そのラベリング生成と、精度向上のの過程における学習において、NNを使うことは十分にありうる

・ラベリングは個別のカメラビューNN向けのラベリング、BEV向けのラベリングなど様々な用途が考えられる

・DOJOの中では、時間も含めて構築した３Ｄ空間の中で自由に往来ができる

・DOJOで鍛えたNNをフリートで展開して試す

・２，もう一つのDOJOでのトレーニング方法→　self-supervised training

・これにより「あるカメラからのシーンは、他のカメラからとらえた場合どのように見えるか？」を問える

・カメラ同士のそれぞれの視点からの「推論同士の誤差」が、エラー・ファンクションとなり、学習を推進できる

・カメラの視点はオーバーラップしているから

・もしくはDOJO内部では、この1フレーム後のシーンはどのようなものか、現在までのフレームに基づいて予測せよ、という問いを投げかけることができる

・DOJOにはすでに時系列３Dデータが取り込まれているので、問いの答え合わせができる

・DOJOで生成された３Dデータのアウトプットは、BEVでも利用できる

・より高い正確性のためには、more nuanced understanding of environmentが必要

・都会の走行では、通行人のボディーランゲージ、ポーズ、モーションさえも理解する必要がある

・膨大なコンピューター演算を可能にするためのDOJO

・ただしそれはFSD向けの演算に特化したシステム

・通常はコンピューターは想像力を持たない

・画像フレームの中に存在しないものを想像することはできない

・DOJOの４D空間の中で、あたかもNNが想像力を持っているかのような質問をすることができる

・「つい先ほどまで見えていて、現在は道端に停車しているバスの後ろに隠れているだろう歩行者を想像してください」

・バスに隠れているその歩行者をラベリングするためのフレームは存在しない。だって隠れていて見えないのだから

・バスに隠れる前ののフレームと、バスから出てきた後のフレームなら存在する

・FSDにおける想像力　(FSDで「想像」自体はかなりの程度まで実現されている)

・DOJOがその想像力の飛躍を可能にする

・場面ごとの予測は現在のFSDでも行われている

・交差点に差し掛かった時、自転車が近づいて来ているか、その自転車が道端で止まっているかどうかで、車両の挙動は明らかに違う

・聖杯としてのビデオ・オート・ラベリング

・DOJOでも最初は人間によるラベリングが必要であろう

・その後DOJOが自分でラベルを生成できるようになれば、人間の仕事はそのラベリングのベリフィケーションとなるだろう

・やがてそのベリフィケーションすらいらなくなるだろう

・DOJOは完璧な未来「予測」はできるが、現実に走行しているフリートは「推論」しかできない

・what three-dimensional dynamic model of that scene is most consistent with all the sensors saw all the way through the scene from beginning to end
　　→DOJOにはこれがわかる（アタリマエ）

・このグランド・トゥルースに基づいて、NNをトレーニングすることができる

・人を上回る推論能力をNNに持たせることが目標

・DOJOはトレーニングの主体ではなく、あくまでNNをその内部でトレーニングさせるための環境

・DOJO内部に構成されるのは、カメラからのインプットデータのみから構成された、時系列を持った３D空間であり、NNのトレーニングのために最適化されている

・現在のDOJOの進捗度

・DOJOインフラの、ハード＆ソフトを構築している段階か？

・NNはセオリー・ドリブンではなく、エンピリカル・ドリブン

・NNのアウトプットを実験前から正確に予測することはできない

・DOJOのプロトタイプはすでに完成済みで、そこでの実験も行われていることだろう

・テスラがラベリングのために独自で開発しているツール群も、数千もの特徴を備えたものであろうし、それにより、ラベリング要員の生産性の向上に役立っていることだろうが、それは進行中のプロセスだ

・FSDはここまでで完成！といったプロダクトではなく、常にバージョンアップされ続けていくものだ

・テスラと同規模でテスラと同じ試みをしている企業は存在しない

・ラベリングツールを開発している企業も数多くあるが、テスラほどの規模に対応できるものはないだろう

・テスラはほぼすべてのラベリングツールを自社で開発しているはず、データ規模が大きすぎて、既存のラベリングツールでは対応できないからだ

・おそらくDOJO向けチップのファーストカットは完成させているだろう

・チップが完成しても課題は多い

・電源デザイン、冷却デザイン、コミュニケーションデザインなど大規模データセンターを構築するのは全く別の課題だ

・ただチップを完成させていれば、それを既存のデータセンターの枠組みの中で、ある程度の規模で運用することはできる

・しかしそれはDOJOと呼べるものではないだろう。

・DOJOの大規模運用が開始されたときには、NN計算コストが劇的に低下していくことだろう

・（Lidarに関しても多少議論されている）

Revealed: Inside Tesla's FSD Neural Nets w/ James Douma (Ep. 255)

DOUMA

the first version of the first Autopilot had a lot more code and very little in the way of NNs in it and over time that has expanded

over the last couple years I’ve seen little snapshots of the NN architectures

looked at what's the state-of-the-art for this particular architecture

like what are they able to achieve with a certain size network

What Tesla is doing is different from what researchers are doing with these network

so from the beginning I was looking at them from that standpoint

and up until FSD beta came out there's a pretty solid steady evolution we could see

the networks they would get bigger occasionally

they would change the structure of the way they were doing

the inputs

the system itself

you can think of it a couple different ways

one way of thinking about it is that

they have a vision network on each camera

what that vision network does is

it takes the video stream coming in from the camera

and it analyzes it

and then it produces a bunch of outputs for each camera

and the outputs might be

where are the stop signs in this frame

or where are the pedestrians in this frame

or where are the cars

or how far away are the cars that you can see

where are the stop lines

where are the markings

where are the curves

so a single NN they started out with a small number of variables

but over time as it's become more sophisticated

the number of variables that they get out of each camera has grown and grown and grown

and now there are like thousands of them literal

that they're asking for all these networks

for some things multiple cameras get used and the data between the cameras kind of interact

these NNs they're outputting a bunch of these variables

now there's more beyond those outputs

if the front camera, one of NN outputs is where are all the pedestrians in the frame

so the output of that is like a frame with little boxes

around all of where all the pedestrians are

say that's what comes out

so another piece of code has to take and make a decision on the basis of that

this is the driving logic

there's a couple of layers to this

there's the sensors themselves what comes out of them

and then what NNs are doing most is perception

which is taking the raw sensor outputs

and turning it into the kind of information

that you could meaningfully use in an human written program

for instance where are the pedestrians or

where are the center lines or

where is the car located in the lane

that's what i'm calling perception here

so taking the sensor input and turning it into something that's usable

after that there's planning which is given these inputs

ｗhat actions should I be taking

to pursue the goal that the car is trying to do

like if it's trying to drive down the street or if it wants to make a right-hand turn

it has to decide okay what are the things I need to do from this point forward

to achieve that goal

and then at the end there's

this layer called control which is

it takes one action at a time

it gets the car to do that

so the control stuff is it actually turns the steering wheel or activates the brakes

and so control is all written by people

the planning part is kind of a mix

at this point probably

i've seen some outputs from the cameras which are clearly intended for planning

they're not just what i'm seeing

the simplest one of those to understand is one of the things that NNs do is

they guess at a path through the scene that's ahead of it

the car is almost always moving forward

so in addition to here are the cars

and here are the lane markings and that kind of stuff

the NN makes a suggestion and it draws this three-dimensional spline

which is just a curve with a couple of bends in it

through the scene ahead of it and that's a recommendation for

where the car probably wants to go

this is the NN looking at this saying

this is probably the way forward

so on a curving road the spline would follow the curve

if you're in a lane and there's a car ahead of you

the spline might go around that car for instance

it'll make some suggestions along those lines so that's a planning function

it's not just a sensing function and

there are other things going on

that we can see in the camera networks

that are planning related

but the overwhelming majority of stuff is just answering questions about

what's the situation outside the car right now

the cars have been doing that for a long time

over time we've seen the networks get bigger and there have been more variables

some of the variables get broken up into small

you can tell that as they bump into problems in the development of particular features

they'll add additional variables to help them refine their understanding of some phenomenon

that they're trying to break down in a way that's easier for the code

that decides what the car is going to do

we saw a really big change when it went to FSD

first of all the network's got a lot bigger

there were a lot more networks and they were a lot bigger than they were before

now when a NN gets bigger that means you're applying a lot more processing to the input

because you're trying to generate more outputs

but a lot of times you just want better accuracy

so the bigger a NN is and the more you train it on

and the more computation you spend on a particular NN

the higher the accuracy of the output could be

so the outputs and NNs they're inherently probabilistic so

when it gives you that little screen and

it's telling you where the pedestrians are

it's drawing little boxes where it thinks it sees pedestrians

and each one of them has a probability associated with it

there's a 60 percent probability there's a pedestrian here

and I think there's a 90 percent here or a 95 percent.

of course what you want is for the NN to be as close to perfectly accurate as possible

especially when it's a really important question

and the bigger you make the network and the more training data

the more accurate those numbers will become

those numbers those probabilities will get a lot closer to 99 percent

and you won't see a lot of 50 60 70 percent things

because that's a problem for the people writing the code

what do you do if the car says I think there's a 60 percent chance

there's a pedestrian in front of you

do you break or do you not

you want the probabilities to get closer to 100 percent and zero percent as the choices

because that makes the programming easier

and it also means that you're going to have fewer overreactions or under reactions that
the vehicle does

and the bigger you make the network and the more data you train it on

the closer you get to that ideal of perfection of always seeing a pedestrian when they're there

always getting seeing the lane markings exactly right and so on

so that's one thing we saw

as we see more networks when we see them get a lot bigger

now previously I described it as if there was one NN per camera

and in the early days it was one NN per camera when they were running on the GP106 the NVIDIA GPU

they had a limited amount of processing power

so they didn't have the luxury of being able to run completely independent networks on every camera

for the purpose of getting different kinds of things

and now what we see is they do run multiple networks on different cameras

and probably the networks are somewhat context dependent

like they might switch networks

depending on whether they're stopped at a light

or driving on a highway or trying to maneuver through an intersection or something

but we also see the networks get kind of specialized

like their networks

that are looking for moving objects

because the moving objects have certain things in common

and if you build a network that just looks at moving objects

for instance the time domain aspects of that kind of stuff are important

then there might be another network for static objects which is to say

things that aren't moving relative to the road

like stop signs and trees and the road itself and curbs they don't move

so we see a proliferation of networks

where a single camera might have two three four

or more networks that all run on and all these networks run in real time

DAVE
I’m wondering what's the state of camera fusion are

and are the NNs being applied to that fusion view

or is it still being applied to individual cameras

DOUMA
they have a fusion architecture

where they take a bunch of cameras

so the individual networks still are still producing the outputs

they also have a really big output

that feeds into a common network

that brings all these cameras together

to create like a single fused view of that

and then the fused networks go into a bird's eye view sort of network after that

and what the bird's eye view network does is

it asks the car to imagine what would the world look like

if I were looking down on the car from a great height

like give me a map of the car and its surroundings

so when you look at the display in the car

if you've watched the FSD videos

that's pretty close to what's coming out of the bird's eye view networks

for instance if you're driving down a road the bird’s eye view network would show the curb next to you

and it would show things that were in the median or on the sidewalk next to you

and things on the other side of them also

the bird’s eye network it'll also guess

if you're driving past a wall or if you're driving past another car

the bird’s eye network doesn't see

what's on the other side of the car which is occluded

the car can't actually see what's on the other side of the car

it'll guess based on what it sees in front of and behind the car

if it sees a curb go extending past a car it'll guess that the curb is extending through

and the bird’s eye network is asked to come to bring together the top-down view from all these different cameras

to synthesize a unified view

and it's also asked to guess about the things that you can't see

and you can see this on some of the FSD videos

where it'll guess sometimes incorrectly about things that it can't see

so when the vehicle is driving past an obstacle

the things on the far side of the obstacle they might vary

or you might see a pedestrian will walk behind a car

and then the network will guess that the pedestrian continues for a little while

and then the pedestrian will vanish because at some point the car's not the AP is no longer sure

if the pedestrian is still there

maybe the pedestrian stopped walking

or maybe they turned and they went some other way

the bird’s eye networks also incorporate time

the camera networks all come together to create a unified view of one frame

and that those get fed into the bird’s eye network

then the bird’s eye network looks back over multiple frames

now so this is something

that is get pretty hard to see in the networks

by the nature of how this stuff goes

so I don't get to see a lot of it

but Karpathy's talked a number of times about this at scaled ML

where he talked in some significant amount of detail

about the architecture they were using in the vision networks and

how they were doing inthe bird’s eye networks

so everything I’ve seen is consistent with what he talked about

so my sense is that

what he talked about on scaled ML a year ago

is a pretty accurate representation of what they're doing on the car

to short answer your question

the network's got a lot bigger

they added a lot of this bird's eye top-down stuff to the systems

and they've added temporal integration

so they're looking across time in addition to just static frames

DAVE
how do you think the FSD software

for example the bird's eye view seems like it's giving a broader view

but then you also have this forward facing view

let's say the main view of the forward-facing cameras

let's say something is happening

when you're going through a turn

there's some type of obstacle

the forward facing cameras sees that obstacle

the bird's eyes view might see something a little different

how does the software reconcile those two different views

and what priority does it give

DOUMA
the bird's eye view is a product

the car of course it can't see down from the top

it's got no way of directly perceiving that

so bird's eye view it accomplishes kind of two independent things

one of them is that

it's a way of asking the car to reconcile all the different cameras

because if you're looking down from the top

no individual camera can see everything around the car from the top

so if you're going to generate a bird's eye view

essentially you've got a little square map and the cars in the center

that's one way of asking the NN to fuse all the camera views

because the front camera looks forward

and the other cameras look to the side

and if you ask it okay now put all that together

and tell me what the whole picture looks like

you can train NNs to take almost any kind of input

and generate almost any kind of output

but you have to have a way of asking a question that's relevant

so that you're challenging the network to come up with outputs that make sense

so that when you train it

you're training it to makes sense

in the context of what you're trying to accomplish

maybe the most important thing the bird's eye view network approach does is

it asks the car to synthesize to put everything together into a picture

that makes sense

you're asking a question that forces BEV to reconcile the multiple overlapping camera views

because if you don't challenge it to do that

it won't learn to do it

so you want to ask a question that's the simplest question you can ask

that at least includes the thing you want

if what you want is to integrate all the camera views into a holistic understanding of what the car's environment is

one thing you can do is

asking the network what would it look like

given all these camera inputs

what do you think it would look like

if I was looking down on the car from the top

so they're asking that question

and that's something they can answer

so that they can determine a ground truth

and provide an error function to feedback to the network

to challenge it to get better

so that's one thing

the other thing that a bird's eye network does is

it's a domain that's easy to write software in

so imagine that you have to write the software to control the car

if I have a camera view forward

and you've got a pedestrian in front of the car

you have to guess how far the pedestrian is

just having a pedestrian in front of the car isn't sufficient to make a decision

about what you should do

you're driving on a curve and that pedestrian's on a sidewalk

when you ask the NN to create a bird's eye view

you're also generating an output that's an easy output

for a programmer to write rules on

because a programmer can look at the bird's eye output

and he can say okay tell me where the road is

so here's the road

and you can ask the question

is this pedestrian in the road are they not on the road

it's not like “Are they in front of me or not”

once you've asked the NN to create this map of the environment

now your programmers have a map to work with

to make decisions about how they want to control the car

so you're getting two things out of the bird’s eye network

one of them is you're getting a straightforward framework

for fusing all these cameras together to get a kind of holistic view
which is way of asking the network to reconcile what it sees on different cameras

and then you're also getting an output that's actually useful

because the people who are writing the planning code and the dynamic control code

they now have a representation that they can work with

it's easy framework for a human programmer to work with rules inside

so bird's eye networks are a very clever solution to that

you could imagine another thing

I imagined that they were trying to make a full three-dimensional sort of virtual model

the world is three-dimensional

a vehicle that's in front of you it has a height a width and a depth and

it occupies some position relative to you

and so as a human being when you're sitting in a driver's seat

you see another vehicle

you have this sense of that thing in space out ahead of you

that you're in a volume of space and

because that's a simple and accurate representation of the reality that we're in

it's a good framework to be able to understand everything

and to work in

but the thing is

it's really challenging if you're asking the simplest thing

the NN could give you this completely all-encompassing depth with everything

which is complete description of the world

if you ask for that and if you challenge a NN to do that

eventually it'll be able to do it

but it's not the simplest thing you can ask that forces it to figure that out

it is just like asking NN that

well what would this scene look like from a different perspective than I am

that also requires a NN to understand that the world is three-dimensional

and other objects which occupy space separate from the vehicle

a NN it doesn't know any of this stuff

it's a complete blank slate

it doesn't even know what a child knows

when you start with it every single little aspect of what it learns about reality

is something it has to figure out from the data that you're giving it

so you have to ask it questions

that challenge it to come up with simple and powerful representations of the world

that you can also build on to write code to control the vehicle

getting a NN to understand that the world is three-dimensional is actually really challenging

we're giving NN a bunch of 2D images

though they're 2D projections onto a 3D world

but it's got to make decisions in 3D and somehow we have to stimulate it to understand that

it's looking at a three-dimensional world

it's not looking at four or five dimensions I guess it's four in a sense

it's a moving three-dimensional world and that's not at all obvious

in the way that we build the NNs

we have to challenge them to figure that out

and the bird's-eye view is a simple clever solution to stimulating the network to figure that out

because you only have a bird's eye view in a three dimensional universe

DAVE
let's say module or the planning code in these in FSD

are they relying more and more on the bird's eye view for planning

because before the bird's eye view

you didn't have that so you're relying more on just the camera view

are you seeing a shift over to more planning on the bird's eye view at all

DOUMA
so I’m mostly looking at NN architectures and

I have to infer what they're doing

based on the outputs

the bird's eye view outputs are going to be a lot easier to work with

than the field of view outputs are

I’m sure that they're making very heavy use of that

one of the things that we see in the evolution of these things over time is

when they had a new piece of capability come out

that we were going to see this discontinuous change in how they were doing things

it's never really been that way

they add new networks

they gradually transform the old ones

they consolidate old ones

I haven't seen them really get rid of anything so they still have all the

outputs that they had before

my guess is that

when the people writing the planning code

suddenly had bird's eye top-down stuff

they didn't immediately abandon the way that they'd been doing stuff

they started integrating it into the way they were doing planning

and over time they'll probably rely more and more on it

as they know they can trust it

and they figure out how to use it effectively

and then gradually the ways they were using the field of view or POV representations

will just kind of gradually go by the wayside

but bird's eye view is super powerful

and the integration they get the 4D stuff

they were in 2D before where they had snapshots

and now they're challenging the system to understand that it's a 3D world

and bird's eye view is a important component of how they're doing that

and they're asking NN to understand that

things evolve over time

if you see multiple frames in a row

if a block is traveling through a scene

and it's labeled truck and it's 90 90 90 80 90 you see it five frames in a row

well you're a lot you can be a lot more confident that the truck is actually there

if you see it in multiple frames

and so little variations in the probability

it doesn't affect your confidence that it's actually a truck

if you have to make decisions based on a single snapshot they do affect your confidence.

DAVE
how do they test and train like for example

for more stationary objects

the neural net will output different boxes or identify different objects

a cat, a car, a truck etc

then you could train it

you could show them the correct things and

you could go back and correct the incorrect object etc

to make the neural net improve but

with the bird's eye view

do you think some type of training is going on

in terms of correcting incorrect type of things

and how is that training being done

DOUMA
it's not nearly as straightforward as deep neural net training

if you show me where the pedestrians are in this photo

if I have a bunch of photos and I draw boxes around all the pedestrians

I can challenge the NN like this

just give me this output like

here's a pedestrian draw box around it

so I have a bunch of pictures with pedestrians which is pretty straightforward

but if I ask the NN like show me a top-down view

and now put boxes around all the pedestrians

that's a lot harder

so you can do a certain amount of labeling

by pulling in other sources that are naturally top down

as a maps for instance

and this is where it starts getting really interesting

so you can try to synthesize a true top-down view of the environment

and this is when Elon was talking about

video labeling and using DOJO to train in video

and the way that it works is you have a car drive through a scene

all right you capture all the output

from all the cameras the accelerometers

all the other sensors and that kind of stuff

you take all the footage from all those cameras

you put it in DOJO

you put in a really big computer

and that computer walks that data back and forth

and figures out what the ground truth must be

and you don't have to use neural networks to do this

you can use geometric priors and other sort of more straightforward geometric analysis

to figure out what the three-dimensional scene must be in that situation

then you can have a human being look at that three-dimensional scene

on a computer in 3D and say

this is a pedestrian

this is a fire hydrant

these are the lane lines

once the computer's got those labels

it can go back to all the frames that were used to make that scene

and it can label all of those inputs and

it can tell you because it's got the whole 3D scene built inside

if I was looking straight down on the car from the top

this is what I would see it at each instant in time

and you can create this three-dimensional model of the thing

then you can automatically generate all the labels

that you need for training

not just the cameras but also the bird's eye view

for instance DOJO can do a bunch of geometric back-end work on a stream of data

where it knows exactly what happened from the beginning to the end

and it can go back and forth over it a bunch of times

and throw a lot of computation at it

and eventually figure out what the 3D scene is

and generate all the labels what the car has to do with it

we're training a NN to figure out what DOJO can do

with a great deal of computation in the back end

and then DOJO can go figure out all this stuff to create the labels

and then we challenge the NN to do this on the fly while it's driving

so that's one way of doing it

there's another technique which Karpathy also talked about in scaled ML

which is a self-supervised training

in self-supervised training

you do a thing where you challenge a NN to tell you

what a scene looks like from one camera

when it's seen from another camera

or you challenge a NN to tell you

that you're the car is driving down a road at 30 miles an hour

and you look out one of the side cameras at the side of the road

and you see a scene the cameras take 36 frames a second

so 1/36 of a second forward in time the scene

will have shifted slightly

and I can ask the NN tell me what the next frame looks like

you can ask the NN to predict what a different camera would see

or what it will see at some point in the future

those techniques they're called self-supervised because nobody has to label the data

the system supervises itself

it generates its own inputs now

the inputs you're testing against aren't quite as meaningful to a programmer

because I’m not asking it tell me where the pedestrians are

that you're seeing out the side camera

I’m asking it tell me

what the camera will see in 30 milliseconds in the future

but the thing is

in order to be able to do that trick of predicting what it's going to see a moment in the future

or predicting what another camera will see at the same moment in time

it has to figure out a lot of stuff

about what the scene really looks like

and so that's a different trick that you can use

it produces kind of different outputs

it'll give you some of the geometric understanding of the scene

which the bird's eye view also requires

and which the bird's eye uses in a different way

so these things work together

when Tesla started doing this

you could just do scene labeling and you could just ask it

where the pedestrians were in the frame

as the systems become more complicated

and they're looking for greater and greater levels of accuracy

more nuanced understanding of environment is needed

an important thing when you're driving down the road

is what's that pedestrian going to do

so you see a pedestrian standing at a curb

this is a big problem when you drive down the street in San Francisco

you're constantly driving about two feet from a bunch of pedestrians

and there's a lot of difference to how you behave with a pedestrian

who's walking along the sidewalk towards you

and one who's looking at their phone standing on a curb

and you're wondering

if they're going to step out in front of you

so eventually the NNs will have to understand all that stuff in real time too

they're going to have to be able to read the body language and pose and motions

of pedestrians

as well as other vehicles and cyclists and that kind of stuff

so that they can predict what that person's going to do

and take appropriate action in San Francisco

you have to predict what pedestrians are doing and that's a pretty hard challenge

so as the challenges that the network needs to be challenging

as the predictions become more and more challenging

we have to get more and more clever

and not just ask it one way of getting the data

but ask it a bunch of different ways

so eventually when they've got DOJO working

they can really throw a lot of computer power at this

they'll be able to do a lot of that drive-through scene video

three-dimensional labeling back propagate all that stuff

that's really computationally intensive

DAVE
correct me if i'm wrong

with video labeling

let's say my model 3 goes through a scene

and right now it's using the NN

in my car to identify objects giving it to planning and then control

but with DOJO let's say

we take that video of the scene

we give it to DOJO

DOJO takes it through a super computer

crazy amount of computation doesn't necessarily have to be neural net

or machine learning

but it could be geometric as well and comes up with pretty much a 3D picture of that scene

and then a human labeler can go through that 3D picture scene

label the key objects

the car moving through
the person crossing the street

the dog going

then we can take those labels and then train back　the neural nets

so that neural nets are more accurate in how they perceive that scene

so we're using the 3D constructing through DOJO

to create a more accurate picture of the environment

as we label it but then using that to train

the neural nets in the car

so in a sense you're giving these neural nets like super power meaning

they're on a different level now because they're using not only

what DOJO has constructed as the environment

but also you do this tens of millions over scenes

and you train the neural nets to identify the object as good as what let's say DOJO would do

you're giving NN an immense amount of increased accuracy

is that kind of the gist of kind of DOJO and video training

DOUMA
you get a lot more accuracy

because essentially you're gonna have a lot more data that's a lot more accurately labeled

so that's one thing that you get out of

but another thing that you get is

if I just label the pictures

and I’m just asking it okay tell me

a human says this is where all the pedestrians are

and I give the unmarked image to a NN

and okay you tell me where the pedestrians are

and then I compare that to reality

I compare that to what the human said

and I generate an error function

I propagate back to improve the network

but one of the things you can't do with that approach is

I can't ask the system to imagine something that's not in the frame

and require it to do a good job of doing that

when we start fusing the networks together

and we build the whole three-dimensional scene

I can start asking

I've got a pedestrian that walked behind　an occlusion

a tree a bus or something

and I want the NN to understand that the pedestrian is still there

it's still walking along

I can ask it to imagine that the pedestrian is there

even though I don't have a picture to label

so DOJO can build the whole three-dimensional scene

including objects that are temporarily occluded

and I can ask the NN

tell me what's behind that bus right now

or tell me what's behind that car

because if it's a moving object a human will know this

if you see a car parked across the street from you

and another car transitions through the intersection

you understand that when the car's in the intersection in front of you

the other car is still at the stop light on the other side

humans know that

but neural networks that trained in a simple way, they don't know that

but with the DOJO approach they can do that

because I’m asking the NN to tell me the whole three-dimensional scene

including the stuff it can't see right now

I was super excited the first time I saw the FSD videos

you could see that the network was labeling stuff it couldn't see on the other side of obstacles

because we were giving them imagination

DAVE
so with the whole 3D creation through DOJO

you're mapping these objects and where they're going to be in the future

where they're headed

in a sense predicting

where they're headed through that whole scene but then teaching the neural nets

that movement

you're teaching them how to see ahead

DOUMA
that's the next step

understanding the situation instantaneously

which might include that

where the car is right now and here's its velocity

that's also an instantaneous thing because the vehicle has a velocity

another thing is tell me where the car is going to be in 100 milliseconds or 200 milliseconds

where is that car going to be when I’m 50 feet forward right

that's another level

beyond what we're talking about right now

they are doing that already

you can already see in the FSD videos

the car behaves differently like when

if it approaches an intersection and　there's a cyclist coming

it behaves very differently than if the cyclist is stopped at the side of the road

there's already evidence that AP is looking at a scene

and it's predicting what the various dynamic objects in the scene are going to do

and responding to that

in the long run it has to be really good at that we expect that of humans

if you're going to pull out in front of another car

you need to have a sense of where

that car is going to be

when you can get done accelerating up to speed

is my path going to cross that other car's path

at any point in the future

if I do this maneuver so you have to be able to project both your path forward

into the future and the other thing and

understand if there's going to be any undesirable interaction there

DAVE

Elon Musk was saying that video auto labeling is the Holy Grail what does he mean by that

DOUMA

where you have a car drive through a scene

you take all the data that comes off of the car

you stuff that into DOJO

so DOJO recreates the 3D scene

you can auto label probably almost all that stuff

if so DOJO also has access to these trained NNs of the　previous version of the car

so it can run through that scene with those networks and it can do a first pass guess

at where all the stop signs are

and as the networks get better

it's going to be doing a really good job of that

it's going to be 99.999 right so

when they first build DOJO and if they're first building these 3D scenes

they'll have to label a lot of stuff

there'll be a lot of details that DOJO isn't getting

but as it gets better DOJO will be able to build this three-dimensional scene

and pre-label thousands of things in the scene and

so then the human labeler's job will be mostly just verifying that DOJO is right

and of course at the tail end of that you don't even need a human in that loop

DOJO can create vast volumes of labeling data

and then you feed that into the NNs

and you close the loop with fewer humans in the process

right now their labelers are limited

they've only got so many labelers

and it's really labor-intensive

when DOJO is labeling it also knows the future

because DOJO's got the whole clip

the whole 10 or 15 seconds

so DOJO knows what the pedestrian was doing in the future

after your car passed the bus and looked back and saw the pedestrian was there

so whereas the NN on the fly

it obviously you don't know the future

but DOJO gets a whole complete thing

DOJO can run it backwards and forwards and can figure out

what three-dimensional dynamic model of that scene is most consistent with all the sensors saw all the way through the scene from beginning to end

and then we asked the NN at any point in time

to guess at the things the NN doesn't know

and eventually like people it'll get good at those guesses

say you have a pedestrian walk behind a bus

and you imagine the pedestrian keeps walking but maybe the pedestrian stopped

you can't know there's an inherent sort of uncertainty to that

DOJO can know what the pedestrian ultimately did

because it'll know after the car drove past the bus

that the pedestrian did in fact emerge from the back of the bus

and the pedestrians movement was consistent with walking

then DOJO knows what the pedestrian must have been doing

when they were included by the bus

of course the real fleet car will never be able to do on the fly

that because it just can't know the future
all the car can do is make a good guess about that but then that's just a limitation of reality

the networks will eventually get really good at predicting the things which can be predicted

but there will always be things that can't be predicted

you don't know if that the pedestrian trip behind the bus and fell over

that's just hard to predict

DAVE
how much of this effort do you think is Tesla

is already doing right now at this moment

creating 3D scenes using that as training is that

something that they're

venturing into right now or is this

something they're kind of waiting for until they get DOJO really up and running

DOUMA
They might be building the infrastructure to do that

and it'll be a work in progress for a long time

NN technologies are new enough

anything you want to do is pretty complicated

there's a good chance nobody's done it before

NNs they're very empirical it's not a theory driven domain

we have theories about NNs and why they do what they do

and they're not very good

so you can't use a theory to predict

if I build the NN three times bigger

and I give it this data and I include other data at　the same time
well now what will my accuracy be? we can't do that

what we do is we build a rough sketch

and we test the idea to see if it makes sense

you can build a prototype that might be crude

the prototype will help you understand the benefit of doing it

so I think they did that

and they've probably got fairly sophisticated prototypes

now and they're probably building their way up the stack

their tools are going to constantly get better

the tools aren't one or two key features

there are thousands of small features that make the labelers more productive

and that　improve the quality and quantity of the output

that you have available for training the networks

FSD is not a product like a toaster where it's just done one day

it'll just keep getting better for a long time

all the way along that every single tool in their arsenal

they'll keep refining as they go

they've probably already prototyped tools that they won't be using in production

for five years or three years or something

and they have other tools that they've been using

for a long time that they're still refining

DAVE
how cutting edge is this

Are other people other companies doing stuff like this

Is there anyone else doing this at scale

DOUMA
I don't think there's anybody doing what Tesla's doing at Tesla scale

there are certainly other people who train NNs

and use lots and lots of labeled data

and there are companies that are in the business of just making labeling tools

you have 500 labelers

and here's a tool that they can sit down at their desk

and it'll make them productive and help them avoid errors

so there's a market for those tools there are plenty of companies that are doing that

I kind of doubt anybody else is doing it at the scale that Tesla's doing it right now

I think they probably are building most of the tools that they're using

Because probably none of the commercial tools that are out there

can handle the scale that they're working at

so yes and no

other people are doing it

But I don't think anybody's doing it at the same scale that Tesla are

DAVE
do you think by the end of this year

Tesla DOJO will be up and running in some form or fashion

that it will make a significant difference to FSD and how accurate it is

DOUMA
so where is DOJO right now

they probably could have the first cut of DOJO silicon done

but DOJO is more than just they have a silicon chip

they want to make

that supports a particular computational architecture that they want

and Elon's already talked about the numerical format that they want to use

which is a numerical format that nobody else builds in silicon

so they're building their own silicon to do this

but to build a system at scale

that uses lots and lots of these chips requires a lot of power design

a lot of cooling design

communications is for these kinds of things is very complicated

and it takes a lot of work to get the communication networks to tile these things together

to build a big machine

and that is a bigger effort than making the silicon

on the other hand you can start using the silicon

if they've got their first version of their chip

they can run off a thousand of those

put them on motherboards go in the back room and pull a google

and use a regular computer racks and get that thing working and

they'll want to do that to start understanding how these things work together

and verify that the chip works and that kind of stuff

is that DOJO?

I think their aspirations are high enough

they want enough sophistication out of this thing

that there's a good chance that they haven't built a full up DOJO at this point

like a full rack of the final design

but they have early versions

my guess is that right now they can probably buy so much computation resources

that the hardware that they've built probably isn't moving the needle on it

will they do that this year?

maybe if they wanted to they could

I don't know if they'll have a final version of DOJO

when they get to where they start scaling DOJO then I think it'll matter

they'll very quickly get to a point

DOJO drops their cost of computation by an order magnitude

like out of the gate

so as soon as they get it for the same amount of money they're spending

they get 10 times as much back-end processing

and that'll move the dial on it for them

when they get that

maybe that'll be this year