テスラ AI DAY スーパーカット

www.youtube.com

テスラのAIデーの、ニューラル・アルゴリズム及びシミュレーション部分のスーパーカットです。

ここで解説されている手法は実務でも利用されているものもあり、その点にモートがあるわけではありません。しかしそれらをすべて統合し、人間の視覚野のシミュレーションともいうべきシステムを構築してしまったこと、それを用いて大規模にトレーニングするためのデータおよび演算環境をそろえてしまっていること。

これは他社では絶対に真似できません。

テスラに唯一近いアプローチを採用している企業は、ジョージ・ホッツのC3.AI (シンボル $AI)ですが、フォーカスしている市場が異なること、データ収集においてはお話にならないことなど、自動運転に関してはテスラに大きく引き離されています。

また、とりわけLiDARを利用して3Dマップを作っているそれ以外の無数の他社のエンジニアは「とりあえずデモを出して何も知らない役員をゴマかしているけど、この技術の先に本当に自動運転が実現できる未来があるのか？」と自問していることでしょう。

その他社の役員は、自社の自動運転の「デモのためのデモ」を見て悦にいっているとともに、それを使っていかに顧客を目くらましするかを日々考えていることでしょう。

①アンドレイ：アーケティクチャー全体：ベクタースペースの生成

②アショック：オートラベリング

③アンドレイ：データ収集

④アショック：シミュレーション

の順番で話が進みます。

トピックは以下です。

〇イーロンが登壇

f:id:stockbh:20211103154630p:plain

what we want to show today is that tesla is much more than an electric car company that we have deep AI activity in hardware on the inference level on the training level

we're arguably the leaders in real world AI as it applies to the real world those of you who have seen the full self-driving beta i can appreciate

the rate at which the tesla neural net is learning to drive this is

a particular application of AI but there's more

there are applications down the road that will make sense

i want to encourage anyone who is interested in solving real-world AI problems at either the hardware or the software level to join tesla or consider joining tesla

〇ANDREJが登壇

i lead the vision team here at tesla autopilot and i'm incredibly excited to be here to kick off this section

giving you a technical deep dive into the autopilot stack and showing you all the under the hood components that go into making the car drive by itself in the vision

component what we're trying to do is we're trying to design a neural network that processes the raw information

which in our case is the eight cameras that are positioned around the vehicle and they send those images and we need to process that in real time into the vector space

and this is a three-dimensional representation of everything you need for driving

so this is the three-dimensional positions of lines edges curbs traffic signs traffic lights cars their positions orientations depth velocities and so on

〇これをやりたいんですけどね。

f:id:stockbh:20211103154953p:plain

here i'm showing the video of the raw inputs that come into the stack and then neural net processes that into the vector space

and you are seeing parts of that vector space rendered in the instrument cluster on the car

what i find fascinating about this is that we are effectively building a synthetic animal from the ground up

the car can be thought as an animal.

it moves around it senses the environment and acts autonomously and intelligently

we are building all the components from scratch in-house

so we are building all the mechanical components of the body the nervous system which is all the electrical components

and for our purposes the brain of the autopilot and specifically for this section the synthetic visual cortex

now the biological visual cortex actually has quite intricate(入り組んだ) structure and a number of areas that organize the information flow of the human brain

in your visual cortexes light hits the retina(網膜)

it goes through the LGN all the way to the back of your visual cortex

then goes through areas v1 v2 v4 the IT

the venture on the dorsal streams and

the information is organized in a certain layout

when we are designing the visual cortex of the car

we also want to design the neural network architecture of how the information flows in the system

〇視覚野のアナロジー

f:id:stockbh:20211103155334p:plain

the processing starts when light hits our artificial retina and we are going to process this information with neural networks

now i'm going to roughly organize this section chronologically(年代順に)

starting off with the neural networks

what they looked like four years ago when i joined the team and how they have developed over time

four years ago the car was mostly driving in a single lane going forward on the highway

and it had to keep lane and it had to keep distance away from the car in front of us

and at that time all of processing was only on individual image level

so a single image has to be analyzed by a neural net

and make little pieces of the vector space

process that into little pieces of the vector space

so this processing took the following shape

we take 1280 by 960 input and this is 12 bit integers streaming in at roughly 36 hertz

now we're going to process that with the neural network

〇レジデュアルNN→レグネッツ

f:id:stockbh:20211103155650p:plain

so instantiate（インスタンス化する） a feature extractor backbone

（レジデュアルNNを使った特徴抽出バックボーン）

in this case we use residual neural networks

we have a stem and a number of residual blocks connected in series

（ステムとレジデュアルブロックス）

now the specific class of resnets that we use are called regnets

regnets offer a very nice design space for neural networks

（レグネッツによるNNデザインスペース）

because they allow you to nicely trade off latency and accuracy

（レイテンシとアキュラシのトレードオフ）

〇REGNETS

f:id:stockbh:20211103160116p:plain

now these regnets give us a number of features as an output at different resolutions in different scales

so in particular on the very bottom of this feature hierarchy

we have very high resolution information with very low channel counts

ボトム(160*120*64)→ディティール把握

and all the way at the top we have low spatial,low resolution but high channel counts

トップ(20*15*512)→概要とコンテクストを把握

so on the bottom we have a lot of neurons that are really scrutinizing the detail of the image

and on the top we have neurons that can see most of the image and have a lot of that scene context

〇BiFPN

f:id:stockbh:20211103160206p:plain

then we like to process this with feature pyramid networks(フィーチャーピラミッドネットワーク：FPN)

in our case we like to use BiFPNs

residual NN ＝　Regnets

feature pyramid networks = BiFPNs

and they get to multiple scales to talk to each other effectively and share a lot of information

so for example if you're a neuron all the way down in the network

you're looking at a small patch and you're not sure this is a car or not

ihelps from the top players are useful

like hey you are actually in the vanishing point of this highway

and so you can disambiguate that this is probably a car

〇検出を担当するヘッドでの、車両の検出ケース

f:id:stockbh:20211103160415p:plain

after BiFPN and a feature fusion across scales

then we go into task specific heads(単体のヘッド)

so for example if you are doing object detection

we have a one stage yolo like object detector here

where we initialize a raster（ラスターイメージ） and there's a binary bit per position telling you whether or not there's a car

and then in addition to that if there is a car

here's a bunch of other attributes you might be interested in

so the x y with height offset or any of the other attributes like what type of a car is this and so on

so this is for the detection by itself

〇単体のヘッドとハイドラネットの違い

f:id:stockbh:20211103161111p:plain

now very quickly we discovered that we don't just want to detect cars

we want to do a large number of tasks

for example we want to do traffic light recognition and detection a lane prediction and so on

quickly we converge in this kind of architectural layout

where there's a common shared backbone and then branches off into a number of heads

so we call these therefore hydranets and these are the heads of the hydra

this architectural layout has a number of benefits

以下はハイドラネットの利点

number one because of the feature sharing we can amortize the forward pass inference in the car at test time and so this is very efficient to run

because if we had to have a backbone for every single task that would be a lot of backbones in the car

number two this decouples all of the tasks so we can individually work on every one task in isolation

and for example we can upgrade any of the data sets or change some of the architecture of the head and so on

and you are not impacting any of the other tasks and so we don't have to revalidate all the other tasks which can be expensive

and number three because there's this bottleneck here in features

so we cache these features to disk and

when we are doing these fine tuning workflows, we only fine-tune from the cached features up and we only fine tune the heads

in terms of our training workflows

we will do an end-to-end training run

once in a while where we train everything jointly

then we cache the features at the multi-scale feature level

and then we fine-tune off of that for a while

and then end-to-end train once again and so on

〇画像上（２D）でのラベリング

f:id:stockbh:20211103161311p:plain

so here's predictions that we were obtaining several years ago from one of these hydro nets

again we are processing individual images and we're making a large number of predictions about these images

here you can see predictions of the stop signs the stop lines the lines the edges the cars the traffic lights the curbs whether or not the car is parked all of the static objects like trash cans cones and so on and

everything here is coming out of the Hydra net

so that was all fine and great

but as we worked towards FSD we quickly found that this is not enough

where this first started to break(ほころび始めた、破綻し始めた) was when we started to work on smart summon(スマート・サモンで)

i am showing some of the predictions of only the curb detection task

and i'm showing it for every one of the cameras

so we'd like to wind our way around the parking lot to find the person who is summoning the car

now the problem is that you can't just directly drive on image space predictions (カメラ画像により表示されている映像（2D）、その中をドライブしていくやり方ではうまくいかなかった。)

you actually need to cast them out and form a vector space around you

〇　オキュパンシートラッカー（ダメ）

f:id:stockbh:20211103161652p:plain

we attempted to do this using c++ and developed the occupancy tracker（オキュパンシー・トラッカー） at the time

here we see that the curb detections from the images are being stitched up across camera scenes across camera boundaries

and over time there have been two major problems with the setup

以下問題点

number one we very quickly discovered that tuning the occupancy tracker and all of its hyper parameters was extremely complicated

you don't want to do this explicitly by hand in c++

you want this to be inside the neural network and train that end-to-end

number two we very quickly discovered that the image space is not the correct output space

you don't want to make predictions in image space

you really want to make it directly in the vector space

〇2Dラベリングを、ベクタースペースに投影してみると。。。

f:id:stockbh:20211103161830p:plain

here's a way of illustrating the issue

i'm showing on the first row the predictions of our curves

and our lines in red and blue they look great in the image

but once you cast them out into the vector space

things start to look really terrible and we are not going to be able to drive on this

you can see how the predictions are quite bad and the reason for this is because you need to have an extremely accurate depth per pixel in order to actually do this projection

(この方式だと、１ピクセル毎の超正確な深度が必要→それは無理)

and so you can imagine just how high of the bar it is

to predict that depth in these tiny every single pixel of the image so accurately

and also if there's any occluded area where you'd like to make predictions

you will not be able to predict it because it's not an image space concept

〇カメラごとに検出して、フュージョンする問題点

f:id:stockbh:20211103162034p:plain

the other problems with this is also for object detection

if you are only making predictions per camera then sometimes you will encounter cases like this

where a single car actually spans five of the eight cameras(8カメラ中5カメラをスパンする車両)

so if you are making individual predictions

then since no single camera sees all of the car and you're not going to be able to do a very good job of predicting that whole car

and it's going to be incredibly difficult to fuse these measurements

〇マルチカメラ画像をどのようにベクタースぺースに統合していったのか

f:id:stockbh:20211103162342p:plain

so instead we'd like to take all of the images and simultaneously feed them into a single neural net and directly output in vector space

now this is very easily said much more difficult to achieve

but roughly we want to lay out a neural net in this way

where we process every single image with a backbone and then we want to fuse them

and we want to re-represent the features

from image space features to directly vector space features

(イメージ空間フィーチャーズとベクトル空間フィーチャーズの違い)

and then go into the decoding of the head

（それを各ヘッドでディコーディング）

2-1 トランスフォーマー（今流行りの手法）

so there are two problems with this problem

number one how do you actually create the neural network components that do this transformation

you have to make it differentiable so that end-to-end training is possible

2-2　ベクトル空間向けの特徴量の抽出

number two if you want vector space predictions from your neural net

you need vector-specific based data sets

just labeling images and so on is not going to get you there

you need vector space labels

for now i want to focus on the neural network architectures

i'm going to deep dive into problem number one

〇BEVのある部分と、各カメラ画像のどの部分が対応しているのか

f:id:stockbh:20211103162736p:plain

we're trying to have this bird's eye view prediction instead of image space predictions

for example let's focus on a single pixel in the output space in yellow

and this pixel is trying to decide

Am i part of a curb or not

where should the support for this kind of a prediction come from in the image space

we know how the cameras are positioned and

they're extrinsic and intrinsic so we can roughly project this point into the camera images

and the evidence for whether or not this is a curve may come from somewhere here in the images

the problem is that this projection is really hard to actually get correct

because it is a function of the road surface and the road surface could be sloping up or sloping down

also there could be other data dependent issues for example there could be inclusion due to a car

so if there's a car occluding this part of the image

then actually you may want to pay attention to a different part of the image

not the part where it projects

and because this is data dependent it's really hard to have a fixed transformation for this component

〇トランスフォーマーの使用

f:id:stockbh:20211103163023p:plain

so in order to solve this issue we use a transformer（トランスフォーマー） to represent this space

this transformer uses multi-headed self-attention（マルチヘッド・セルフアテンション）

and blocks off it in this case

we can get away with even a single block doing a lot of this work effectively

what this does is you initialize a raster(ラスターイメージ) of the size of the output space

and you tile it with positional encodings（ポジショナル・エンコーディングズ）

with size and coses in the output space

and then these get encoded with an MLP into a set of query vectors（クエリ―ベクトル）

and then all of the images and their features also emit（放出する） their own keys and values

and then the queries keys and values feed into the multi-headed self-attention

what's happening is that every single image piece is broadcasting what it is a part of in its key

i'm part of a pillar in roughly this location

and i'm seeing this kind of stuff and that's in the key

then every query is along the lines of hey i'm a pixel in the output space at this position

and i'm looking for features of this type then the keys and the queries interact multiplicatively

and then the values get pulled accordingly

and so this re-represents(空間の再表象) the space

and we find this transformation to be very effective if you do all of the engineering correctly

this again is very easily said difficult to do

you need to do all of the engineering correctly

〇カメラキャリブレーション

f:id:stockbh:20211103163253p:plain

so one more thing you have to be careful with some of the details here when you are trying to get this to work

in particular all of our cars are slightly cockeyed in a slightly different way

and so if you're doing this transformation from image space to the output space

you really need to know what your camera calibration is

and you need to feed that into the neural net

and so you could definitely just like concatenate（文字列を結合する）the camera calibrations of all of the images

and somehow feed them in with an MLP

but we found that we can do much better by transforming all of the images into a synthetic virtual camera（シンセティック・バーチャル・カメラ）

using a special rectification（整流） transform

〇レクティフィケーション→コモンバーチャルカメラ

f:id:stockbh:20211103163436p:plain

so this is what that would look like

we insert a new layer right above the image which is rectification layer(整流レイヤー)

it's a function of camera calibration and it translates all of the images into a virtual common camera（バーチャル・コモン・カメラ）

so if you were to average up（次々と平均値を求める） a lot of repeater images(リピーターカメラの画像) for example which faced at the back

without doing this you would get a kind of a blur

but after doing the rectification transformation（整流トランスフォーメーション）

you see that the back mirror gets really crisp（クッキリとした）

this improves the performance quite a bit

〇ドライバブルなベクタースペースの生成

f:id:stockbh:20211103163607p:plain

so here are some of the results on the left we are seeing what we had before

and on the right we're now seeing significantly improved predictions coming directly out of the neural net

this is a multi-camera network predicting directly in vector space

it's basically night and day you can actually drive on this

this took some time and some engineering and incredible work from the AI team to actually get this to work and deploy and make it efficient in the car

this also improved a lot of our object detection

〇マルチカムとシングルカムの性能差

f:id:stockbh:20211103163730p:plain

so for example here in this video i'm showing single camera predictions in orange

and multi-camera predictions in blue

if you can't predict these cars if you are only seeing a tiny sliver of a car

your detections are not going to be very good

and their positions are not going to be good

but a multi-camera network does not have an issue

here's another video from a more nominal sort of situation

and we see that as these cars in this tight space across camera boundaries

there's a lot of jank that enters into the predictions

and the whole setup just doesn't make sense especially for very large vehicles like this one

and we can see that the multi-camera networks struggle significantly less with these kinds of predictions

so at this point we have multi-camera networks and they're giving predictions directly in vector space

but we are still operating at every single instant in time(個々の瞬間画像) completely independently

〇画像のみの場合の問題点→記憶の欠如

f:id:stockbh:20211103163938p:plain

so very quickly we discovered that there's a large number of predictions we want to make that actually require the video context and we need to figure out how to feed this into the net

in particular is this car parked or not

is it moving? how fast is it moving? is it still there? it's temporarily occluded

or for example if i'm trying to predict the road geometry ahead

it's very helpful to know the signs or the road markings that i saw 50 meters ago

〇ビデオニューラルネットアーケティクチャー

f:id:stockbh:20211103164128p:plain

so we try to insert video modules（時系列データを認識するビデオモジュール）into our neural network architecture and this is one of the solutions that we've merged on

we have the multi-scale features as we had them from before

and what we are going to now insert is a feature queue module（フィーチャー・キュー・モジュール）

that is going to cache some of these features over time

and then a video module that is going to fuse this information temporally

and then we're going to continue into the heads that do the decoding

now i'm going to go into both of these blocks one by one

also in addition notice here that we are also feeding in the kinematics（キネマティクス／運動学）

this is basically the velocity and the acceleration that's telling us about how the car is moving

so not only are we going to keep track of what we're seeing from all the cameras

but also how the car has traveled

〇フィーチャーキューモジュール

f:id:stockbh:20211103164310p:plain

so here's the feature cue and the rough layout of it

we are basically concatenating（連結する） these features over time

and the kinematics of how the car has moved

and the positional encodings and that's being concatenated encoded and stored in a feature queue

and that's going to be consumed by a video module

now there's a few details again to get right

in particular with respect to the pop and push mechanisms

and when do you push

〇時間と空間の記憶

f:id:stockbh:20211103164558p:plain

here's a cartoon diagram illustrating some of the challenges

there's going to be the ego cars coming from the bottom and coming up to this intersection here

and then traffic is going to start crossing in front of us

and it's going to temporarily start occluding some of the cars ahead

and then we're going to be stuck at this intersection for a while and just waiting our turn

this is something that happens all the time and it's a cartoon representation of the challenges

so number one with respect to the feature queue and when we want to push into a queue

obviously we'd like to have a time-based queue

where for example we enter the features into the queue say every 27 milliseconds

and so if a car gets temporarily occluded

then the neural network now has the power to be able to look and reference the memory in time

and learn the association that hey even though this thing looks occluded right now

there's a record of it in my previous features

and i can use this to still make a detection

for example suppose you're trying to make predictions about the road surface and the road geometry ahead

and you're trying to predict that i'm in a turning lane and the lane next to us is going straight

then it's really necessary to know about the line markings and the signs

and sometimes they occur a long time ago

and if you only have a time-based queue(時間ベースのキュー) you may forget the features

while you're waiting at your red light

so in addition to a time-based queue we also have a space-based queue（空間ベースのキュー）

we push every time the car travels a certain fixed distance

in this case we have a time based queue and a space-based queue to feed to cache our features

and that continues into the video module

〇ビデオモジュール

f:id:stockbh:20211103165048p:plain

now for the video module we looked at a number of possibilities of how to fuse this information temporally

so we looked at three-dimensional convolutions transformers, axial transformers

(3Dコンボリューション・トランスフォーマーズ、アクシャル・トランスフォーマーズ)

in an effort to try to make them more efficient recurrent neural networks (RNN) of a large number of flavors

〇空間RNNビデオモジュール

f:id:stockbh:20211103165219p:plain

i want to spend some time on is a spatial recurrent neural network video module(空間RNNビデオ・モジュール)

because of the structure of the problem we're driving on two-dimensional(2D) surfaces

we can actually organize the hidden state into a two-dimensional lattice（2Dラティス）

and then as the car is driving around

we update only the parts that are near the car and where the car has visibility

so as the car is driving around

we are using the kinematics to integrate the position of the car in the hidden features grid

and we are only updating the RNN at the points where we have that are nearby us

〇多様な空間認識が生まれる

f:id:stockbh:20211103165411p:plain

here's an example of what that looks like

the car is driving around

and we're looking at the hidden state of this RNN

and these are different channels in the hidden state(短期記憶、ワーキングメモリの役割を果たす)

so after optimization and training this neural net

some of the channels are keeping track of different aspects of the road

for example the centers of the road the edges the lines the road surface and so on

f:id:stockbh:20211103165634p:plain

so this picture is looking at the mean of the first 10 channels for different traversals of different intersections in the hidden state

there's cool activity as the recurrent neural network is keeping track of what's happening at any point in time

and you can imagine that we've now given the power to the neural network

to actually selectively read this memory and write to this memory

so for example if there's a car right next to us and is occluding some parts of the road

then now the network has the ability to not to write to those locations

but when the car goes away and we have a good view

then the recurring neural net can say okay we have very clear visibility we definitely want to write information about what's in that part of space

〇空間RNN

f:id:stockbh:20211103165845p:plain

here's a few predictions that show what this looks like

here we are making predictions about the road boundaries in red

intersection areas in blue road centers and so on so

we're only showing a few of the predictions here

just to keep the visualization clean

and yeah this is done by the spatial RNN(空間RNN)

and this is only showing a single clip

a single traversal but you can imagine there could be multiple trips through here

and　number of cars a number of clips could be collaborating to build this map

which is an HD map

except it's not in a space of explicit items

The “HD map”is in a space of features of a recurrent neural network

the video networks also improved our object detection

〇ビデオネットワークによる、ドロップの改善

f:id:stockbh:20211103170151p:plain

(オクルードされた場合に、ビデオモジュールでは2台の車両は、ディテクトされているのに、シングルフレームでは、ドロップしている)

so in this example i want to show you a case where there are two cars over there and one car is going to drive by and occlude them briefly

so look at what's happening with the single frame predictions

and the video predictions as the cars pass in front of us

so that makes a lot of sense so a quick playthrough through what's happening when both of them are in view

the predictions are roughly equivalent

and you are seeing multiple orange boxes

because they're coming from different cameras

when they are occluded the single frame networks drop the detection

but the video module remembers it and we can persist the cars

and then when they are only partially occluded

the single frame network is forced to make its best guess about what it's seeing and it's forced to make a prediction and it makes a terrible prediction

but the video module knows that there's only a partial

knows that this is not a very easily visible part right now and doesn't actually take that into account

〇深度と加速度の改善

f:id:stockbh:20211103170448p:plain

we also saw significant improvements in our ability to estimate depth and especially velocity

so here i'm showing a clip from our remove the radar push

where we are seeing the radar depth and velocity in green

and we were trying to match or even surpass the signal just from video networks alone

and what you're seeing here is in orange

we are seeing a single frame performance

and in blue we are seeing again video modules and so you see that the quality of depth is much higher

and for velocity

the orange signal, you can't get velocity out of a single frame network

so we just differentiate depth to get that but the video module is right on top of the radar signal

and so we found that this worked extremely well for us

〇現段階の全体図

f:id:stockbh:20211103170658p:plain

so here's putting everything together this is what our architectural roughly looks like today

we have raw images feeding on the bottom

they go through a rectification layer to correct for camera calibration

and put everything into a common virtual camera

we pass them through regnet's (residual networks) to process them into a number of features at different scales

we fuse the multi-scale information with BiFPN

this goes through transformer module to re-represent it into the vector space

in the output space this feeds into a feature queue in time

or space that gets processed by a video module like the spatial rnn

and then continues into the branching structure of the hydra net

with trunks and heads for all the different tasks

so that's the architecture roughly what it looks like today

and on the right you are seeing some of its predictions which visualize both in a top-down vector space

and also in images this architecture has been definitely complexified from just very simple image-based single network about three or four years ago

and continues to evolve

now there's still opportunities for improvements that the team is actively working on

you'll notice that our fusion of time and space is fairly late in neural network terms

we can do earlier fusion of space or time

and do cost volumes or optical flow like networks on the bottom

or our outputs are dense rasters(ラスタ)

and it's actually pretty expensive to post-process some of these dense rasters in the car

and we are under very strict latency requirements so this is not ideal

we actually are looking into all kinds of ways of predicting

just the sparse structure(スパース・ストラクチャー) of the road

maybe point by point or in some other fashion that is that doesn't require expensive post processing

but this basically is how you achieve a very nice vector space

〇インド出身のアショクが登壇

最適化問題を解く

f:id:stockbh:20211110172128p:plain

hi everyone my name is ashok i lead the planning and controls auto labeling and simulation teams

the visual networks take dense video data and then compress it down into a 3D vector space

the role of the planner is to consume this vector space and get the car to the destination while maximizing the safety comfort and the efficiency of the car

even back in 2019 our planner was pretty capable driver

it was able to stay in the lanes make lane changes as necessary

and take exits of the highway

but cdc(市街地) driving is much more complicated

there are structured lane lines and vehicles do much more free from driving

then the car has to respond to all of curtains and crossing vehicles and pedestrians doing funny things

〇プランニングにおける問題点

f:id:stockbh:20211110172334p:plain

what is the key problem in planning

1,number one the action space is very non-convex(非凸型) and

→局所最適に陥ってしまうケースが数多くあるということ

2,number two it is high dimensional

what I mean by non-convex is there can be multiple possible solutions that can be independently good but getting a globally consistent solution is pretty tricky

so there can be pockets of local minima that the planning can get stucked into

and secondly the high dimensionality comes because the car needs to plan for the next 10 to 15 seconds

and needs to produce the position velocities and acceleration or this entire window

there are many parameters to be produced at runtime

discrete search(離散検索、個別探索) methods are really great at solving non-convex problems

because they are discrete they don't get stuck in local minima(局所最小、局所最適、部分最適)

whereas continuous function optimization(連続最適化) can easily get stuck in local minima and

produce poor solutions that are not great

on the other hand for high dimensional problems

a discrete search sucks

because it does not use any graded information（グレーディド・インフォ） so literally has to go and explore each point to know how good it is

whereas continuous optimization use gradient-based methods（確率勾配法） to very quickly go to a good solution

〇ハイブリッドプランニングによる、局所最適の回避

f:id:stockbh:20211110172505p:plain

our solution to this problem is to break it down hierarchically

first use a coarse search method（コアース・サーチ） to crunch down(踏みつける、かみ砕く) the non-convexity and come up with a convex corridor

→コンベックスな問題に置き換える

and then use continuous optimization techniques to make the final smooth trajectory

→のちに連続的最適化

〇多くのプランニングを走らせる

f:id:stockbh:20211110172752p:plain

let's see an example of how the search operates

so here we're trying to do a lane change

in this case the car needs to do two back-to-back（連続） lane changes to make the left turn up ahead

for this the car searches over different maneuvers

the first one is a lane change that's close by

but the car breaks pretty harshly so it's pretty uncomfortable

the next maneuver tried is the lane change

it speeds up　

goes in front of the other cars and do the lane change bit late

but now it risks missing the left turn

we do thousands of such searches in a very short time span

because these are all physics-based models these futures are very easy to simulate

and in the end we have a set of candidates and we finally choose one based on the optimality conditions of safety comfort and easily making the turn

so now the car has chosen this path

and you can see that as the car executes this trajectory

it matches what we had planned

the cyan(水色) plot on the right side is the actual velocity of the car

and the white line be underneath was a plan

so we are able to plan for 10 seconds here and able to match that when you see in hindsight(後知恵、後付け、後から)

so this is a well-made plan

〇自分以外の物体のプランニング

f:id:stockbh:20211110173010p:plain

when driving alongside other agents it's important to not just plan for ourselves but instead we have to plan for everyone jointly

and optimize for the overall scenes traffic flow

in order to do this what we do is we literally run the autopilot planner on every single relevant object in the scene

（関連するすべてのオブジェクトにオートパイロット・プラナーを走らせる）

〇道を譲りあうケース

f:id:stockbh:20211110201451p:plain

here's an example of why that's necessary

this is an auto corridor i'll let you watch the video for a second

there was autopilot driving an auto corridor going around parked cars cones and poles

here there's a 3D view of the same thing

the oncoming car arrives now and autopilot slows down a little bit

but then realizes that we cannot yield to them because we don't have any space to our side but the other car can yield to us instead

so instead of just blindly breaking here

they can pull over and should yield to us because we cannot yield to them

and assertively(自信を持って) makes progress

a second oncoming car arrives now this vehicle has higher velocity

we literally run the autopilot planner for the other object

so in this case we run the panel for them that object's plan

now goes around their parked cars

and then after they pass the parked cars goes back to the right side of the road for them

since we don't know what's in the mind of the driver

we actually have multiple possible futures for this car

one future is shown in red the other one is shown in green

and the green one is a plan that yields to us

but since this object's velocity and acceleration are pretty high

we don't think that this person is going to yield to us

and they are actually going to go around these parked cars

so autopilot decides that okay i have space here

this person's definitely gonna come so i'm gonna pull over

so as autopilot is pulling over we notice that

the car has chosen to yield to us

based on their yaw rate and their acceleration

and autopilot immediately changes his mind

and continues to make progress

this is why we need to plan for everyone

because otherwise we wouldn't know that this person is going to go around the other parked cars

and come back to their side

if we didn't do this autopilot would be too timid(臆病)

and would not be a practical self-driving car

〇損失関数の最小化

f:id:stockbh:20211110201801p:plain

f:id:stockbh:20211110202138p:plain

so now we saw how the search and planning for other people set up a convex valley

finally we do a continuous optimization to produce the final trajectory

that the planning needs to take

the gray width area is the convex corridor(コンベクス・コリドー)

and we initialize the spline in heading and acceleration

parameterized or the arc length of the plan

and you can see that continuously the compromisation makes fine-grained changes to reduce all of its costs

some of the costs are distance from obstacles traversal time and comfort

for comfort you can see that the lateral acceleration plots on the right have nice trapezoidal shapes

on the right side the green plot that's a nice trapezoidal(台形の) shape

and if you record on a human trajectory

this is pretty much how it looked like

the lateral（側面、側部） jerk（躍度、加加速度、単位時間あたりの加速度の変化率） is also minimized

so in summary we do a search for both us and everyone else in the scene

we set up a convex corridor and then optimize for a smooth path

together these can do some really neat things like shown above

〇複雑なケース

f:id:stockbh:20211110202609p:plain

but driving looks a bit different in other places like where i grew up from

it's very much more unstructured cars and pedestrians cutting each other harsh braking honking it's a crazy world

we can try to scale up these methods but it's going to be really difficult to efficiently solve this at runtime(運転しているその瞬間ごとに)

instead what we want to do is using learning based methods

and i want to show why this is true

so we're going to go from this complicated problem to a much simpler toy parking problem

but still illustrates the core of the issue

here this is a parking lot, the ego car is in blue and needs to park in the green parking spot here

so it needs to go around the curbs the parked cars and the cones shown in orange here

there's a simple baseline it's A-star

A-star is the standard algorithm that uses a ladder space search(ラダースペースサーチ)

and in this case the heuristic here is the Euclidean distance to the goal

〇ブルートフォース

f:id:stockbh:20211110203213p:plain

you can see that it directly shoots towards the goal but very quickly gets trapped in a local minima and it backtracks（引き返す） from there

and then searches a different path to try to go around this parked car

eventually it makes progress and gets to the goal but it ends up using 400,000 nodes for making this

obviously this is a terrible heuristic

we want to do better than this

〇ブルートフォース＋ガイド

f:id:stockbh:20211110204600p:plain

so if you added a navigation route to it and has the car to follow the navigation route

while being close to the goal this is what happens

the navigation route helps immediately

but still when it encounters cones or other obstacles

it basically does that same thing as before

backtracks and then searches the whole new path

this poor search has no idea that these obstacles exist

it literally has to go there and has to check if it's in collision

and if it's in collision then back up

the navigation heuristic helped but still took 22,000 nodes

we can design more these heuristics to help the search make go faster

but it's really tedious(うんざり、退屈) and hard to design a globally optimal heuristic

even if you had a distance function from the cones that guided the search

this would only be effective for the single cone

〇モンテカルロツリー探索

f:id:stockbh:20211110205115p:plain

what we need is a global value function(グローバル・バリュー関数)

so instead of what we want to use is neural networks to give this heuristic for us

the vision networks produces vector space and we have cars moving around in the vector space

this looks like a atari game and it's a multiplayer version

so we can use techniques such as alpha zero etc that was used to solve GO and other atari games to solve the same problem

so we're working on neural networks that can produce state and action distributions

that can then be plugged into Monte-Carlo Tree Search（モンテ・カルロ・ツリーサーチ） with various cost functions

some of the cost functions can be explicit cost functions

like distance, collisions, comfort, traversal time etc

but they can also be interventions from the actual manual driving events

we train such a network for this simple parking problem

so here again same problem

〇オーダーオブマグニチュードの改善

f:id:stockbh:20211110205607p:plain

let's see how MCT（モンテカルロツリー） searched us

so here you notice that the plan is basically able to make progress towards the goal in one shot

to notice that this is not even using a navigation heuristic just given the scene

the plan is able to go directly towards the goal

all the other options you're seeing are possible options

it does not choose any of them just using the option that directly takes it towards the goal

the reason is that the neural network is able to absorb the global context of the scene

and then produce a value function that effectively guides it towards the global minima(全体最適)

as opposed to getting stuck in any local minima

so this only takes 288 nodes

（40万→2万→300）

and several orders of magnitude less than what was done in the A-star with the equilibrium distance heuristic

〇プラナーの設計

f:id:stockbh:20211110211418p:plain

this is what a final architecture is going to look like

the vision system is going to crush down the dense video data into a vector space

it's going to be consumed by both an expressive planner and a neural network planner

in addition to this

the neural network planner can also consume intermediate features of the network

（エクスプレッシブ・プラナー、NNプラナー、インターミディエートフィーチャーズ）

together this producer trajectory distribution

and it can be optimized end to end both with explicit cost functions（顕示コスト関数） and human intervention and other data

this then goes into explicit planning function（顕示プランニング関数）

that does whatever is easy for that and produces the final steering and acceleration commands for the car

with that we need to now explain how we train these networks

and for training these networks we need large data sets

〇アンドレイが再び登壇

the story of data sets is critical

so far we've talked only about neural networks but neural networks only establish an upper bound on your performance

many of these neural networks have hundreds of millions of parameters and these hundreds of millions of parameters they have to be set correctly

if you have a bad setting of parameters it's not going to work

so neural networks are just an upper bound

you also need massive data sets to actually train the correct algorithms inside them

now in particular I mentioned we want data sets directly in the vector space

and so the question becomes how can you accumulate

because our networks have hundreds millions of parameters

how do you accumulate millions and millions of vector space examples

that are clean and diverse to train these neural networks effectively

so there's a story of data sets and how they've evolved

on the side of all of the models and developments that we've achieved

when i joined roughly four years ago we were working with a third party to obtain a lot of our data sets

unfortunately we found quickly that working with a third party to get data sets for something this critical was just not going to cut it

the latency of working with a third party was extremely high and honestly the quality was not amazing and so in the spirit of full vertical integration at tesla

we brought all of the labeling in-house and

over time we've grown more than one thousand person data labeling org

that is full of professional labelers who are working very closely with the engineers

so actually they're here in the us and co-located with the engineers here in the area as well

and so we work very closely with them and we also build all of the infrastructure ourselves for them from scratch

so we have a team we are going to meet later today that develops and maintains all of this infrastructure for data labeling

for example i'm showing some of the screenshots of some of the latency throughput and quality statistics that we maintain about all of the labeling workflows

and the individual people involved and all the tasks and how the numbers of labels are growing over time

we found this to be quite critical and we're very proud of this

〇数年前のラベリング

f:id:stockbh:20211110212600p:plain

in the beginning roughly three or four years ago most of our labeling was in image space（2D labeling）

and this takes quite time to annotate(注釈をつける、ラベリングする) an image like this

and this is what it looked like where we are drawing polygons and polylines

on top of these single individual images

as we need millions of vector space labels

this method is not going to cut it

〇４Dラベリング

f:id:stockbh:20211110212704p:plain

quickly we graduated to three-dimensional or four-dimensional labeling

where we directly label in vector space（多変数空間） not in individual images

so here is a clip and you see a very small reconstruction of the ground plane on which the car drove

and a little bit of the point cloud here that was reconstructed

and what you're seeing here is that the labeler is changing the labels directly in vector space

and then we are reprojecting those changes into camera images

（カメライメージへはあくまでプロジェクションしてるだけ。ラベリングはベクタースペースで行われる）

so we're labeling directly in vector space and this gave us a massive increase in throughput

because if it is labeled once in 3D and then you get to reproject

but even this was actually not going to cut it

because people and computers have different pros and cons

so people are extremely good at things like semantics but computers are very good at geometry reconstruction triangulation tracking

and for us it's much more becoming a story of how do humans and computers collaborate to actually create these vector space data sets

and so we're going to now talk about auto labeling which is the infrastructure we've developed for labeling these clips at scale

〇インド出身のアショクが再び登壇

こういうふうにラベリングしたいんだが。

f:id:stockbh:20211110214156p:plain

even though we have lots of human labelers

the amount of training data needed for training the network significantly outnumbers them

we invested in a massive auto labeling pipeline

here's an example of how we label a single clip

a clip is entity that has dense sensor data

like videos,IMU data,GPS automatically etc

this can be 45 second to a minute long

these can be uploaded by our own engineering cars or from customer cars

we collect these clips and then send them to our servers

where we run a lot of neural networks offline to produce intermediate results

like segmentation masks depth point matching etc

this then goes to a lot of robotics and AI algorithms to produce a final set of labels

that can be used to train the networks

〇NeRFの利用

f:id:stockbh:20211110221238p:plain

one of the first tasks we want to label is the road surface

typically we can use splines or meshes to represent the road surface

but because of the topology restrictions

those are not differentiable and not amenable（従順） to producing this

so what we do instead from last year is in the style of neural radiance fields work (NeRF:ニューラルネットワークによる三次元空間表現手法)

we use an implicit representation to represent the road surface

here we are querying xy points on the ground

and asking for the network to predict the height of the ground surface

along with various semantics（セマンティクス：個々の部分の意味） such as curves lane boundaries road surface rival space etc

given a single xy we get a z together these make a 3D point

and they can be re-projected into all the camera views

so we make millions of such queries and get lots of points

these points are re-projected into all the camera views

on the top right here, we are showing one such camera image with all these points re-projected

now we can compare this re-projected point with the image space prediction of the segmentations

and jointly optimizing this all the camera views across space and time

and produces an excellent reconstruction

〇ベクトル空間上で、再構成されたロードをプロジェクション

f:id:stockbh:20211110221622p:plain

here's an example of how that looks like

so here this is an optimized road surface that is reproduction to the eight cameras that the car has

and across all of time

and you can see how it's consistent across both space and time

〇運転しながらベクトル空間を生成している

f:id:stockbh:20211110221800p:plain

so a single car driving through some location can sweep out some patch around the trajectory using this technique

but we don't have to stop there

so here we collected different clips from different cars at the same location

and each of fleet sweeps out some part of the road

〇生成されたベクトル空間を相互に組み合わせる

f:id:stockbh:20211110221903p:plain

now we can bring them all together into a single giant optimization

so here these 16 different trips are organized

using various features such as road edges lane lines

all of them should agree with each other

and also agree with all of their image space observations

together this produces an effective way to label the road surface

not just where the car drove but also in other locations that it hasn't driven

the point of this is not to build HD-maps or anything like that

it's only to label the clips through these intersections

so we don't have to maintain them forever

as long as the labels are consistent with the videos that they were collected

then humans can come on top of this

clean up any noise or add additional metadata to make it even richer

〇ポイントクラウド

f:id:stockbh:20211110222128p:plain

we don't have to stop at just the road surface

we can also arbitrarily(任意に) reconstruct 3D static obstacles

here this is a reconstructed 3D point cloud from our cameras

the main innovation here is the density of the point cloud

typically these points require texture to form associations from one frame to the next frame

but here we are able to produce these points even on textured surfaces

like the road surface or walls

and this is really useful to annotate（注釈をつけて、ラベリングすること） arbitrary obstacles

that we can see on the scene in the world

〇利点その１後知恵

f:id:stockbh:20211110222355p:plain

one more cool advantage of doing all of this on the servers offline is that

we have the benefit of hindsight(後知恵)

this is a super useful hack

because say in the car then the network needs to produce the velocity

it just has to use the historical information and guess what the velocity is

but here we can look at both the history but also the future

we can cheat and get the correct answer of the kinematics like velocity acceleration etc

〇利点その２　パーシステンシー

f:id:stockbh:20211110222630p:plain

one more advantage is that we have different tracks

but we can switch them together even through occlusions

because we know the future

we have future tracks

we can match them and then associate them

here you can see the pedestrians on the other side of the road are persisted

even through multiple occlusions by these cars

this is really important for the planner

because the planner needs to know if it saw someone it still needs to account for them even they are occluded

so this is a massive advantage

〇ベクトル空間の生成に成功

f:id:stockbh:20211110222805p:plain

combining everything together

we can produce these amazing data sets

that annotate all of the road texture all the static objects and all the moving objects even through occlusions

producing excellent kinematic labels all you can see how the cars turn smoothly

produce really smooth labels all the pedestrians are consistently tracked

the parked cars obviously zero velocity so we can know that cars are parked

so this is huge for us

this is one more example of the same thing you can see how everything is consistent

we want to produce a million labeled clips of such and train our multi-cam video networks(マルチカム・ビデオ・ネットワーク) with such a large data set

and want to crush this problem

we want to get the same view that's consistent that you're seeing in the car

〇レアケースでのドロップ

f:id:stockbh:20211110223320p:plain

we started our first exploration of this with the Remove The Radar project

we removed it in a short time span like within three months

in the early days of the network

we noticed for example in lower security conditions the network can suffer understandably

because obviously this truck just dumped a bunch of snow on us and it's really hard to see

but we should still remember that this car was in front of us

but our networks early on did not do this because of the lack of data in such conditions

〇フリートからデータ収集

f:id:stockbh:20211110223441p:plain

f:id:stockbh:20211110223552p:plain

so what we did was that we asked the fleet to produce lots of similar clips

and the fleet responded it

it produces lots of video clips where shit's falling out of other vehicles

and we've sent this through auto leveling pipeline

that was able to label 10k clips within a week(1週間で1万ビデオクリップのラベリング)

this would have taken several months with humans labeling

so we did this for 200 of different conditions

and we were able to very quickly create large data sets

and that's how we were able to remove radar

〇もうドロップしない。レーダーもいらない。

f:id:stockbh:20211110223911p:plain

so once we train the networks with this data

you can see that it's totally working and keeps the memory that this object was there

〇ベクトル空間からシミュレーションへ

f:id:stockbh:20211110224132p:plain

f:id:stockbh:20211110224235p:plain

finally we wanted to get a cyber truck into a data set for remove the radar

can you all guess where we got this clip from

it's rendered it's our simulation

it was hard for me to tell initially and it looks very pretty

in addition to auto labeling

we also invest heavily in using simulation for labeling our data

(シミュレーションとオート・ラベリングの関係)

so this is the same scene as seen before but from a different camera angle

so a few things that i wanted to point out

for example the ground surface it's not a plane asphalt there are lots of cars and cracks and tower seams there's some patchwork done

on top of it vehicles move realistically

the truck is articulated even goes over the curb and makes a wide turn

the other cars behave smartly they avoid collisions they go around cars

and also brake and accelerate smoothly

Autopilot is driving the car with the logo on the top and it's making unprotected left turn

〇シミュレーション上ではすべてが完璧にラベリングされている

f:id:stockbh:20211110224416p:plain

since it's a simulation, it starts from the vector space so it has perfect labels

here we show a few of the labels that we produce

these are vehicle cuboids with kinematics

depth surface normals segmentation but

アンドレア・カパーシー may name a new task that he wants next week

and we can very quickly produce it

because we already have the vector space and we can write the code to produce these labels quickly

〇シミュレーションが有効となるケース

f:id:stockbh:20211110224548p:plain

so when does simulation help

データの入手が難しいケース

number one it helps when the data is difficult to source（手に入れる） as large as our fleet is

(テスラほどのオートパイロット搭載車両数を持ってしても)

it can be hard to get some crazy scenes like this couple

they run with their dog running on the highway while there are other high-speed cars around

this is a rare scene but still can happen

and autopilot still needs to handle it

ラベリングに膨大な作業が必要な時

it helps when data is difficult to label

there are hundreds of pedestrians crossing the road

this could be a manitoban downtown people crossing the road

it's going to take several hours for humans to label this clip

and even for automatic labeling algorithms

this is really hard to get the association right

and it can produce bad velocities

but in simulation this is trivial

because you already have the objects

you just have to spit out the cuboids and the velocities

クローズド・ループにおける適正行動を導入したいとき

finally it helps when we introduce closed loop behavior

where are the cars and where it needs to be

in a determining situation or the data depends on the actions

this is the only way to get it reliably

all this is great

f:id:stockbh:20211110224806p:plain

f:id:stockbh:20211110225034p:plain

what's needed to make this happen

number one accurate sensor simulation again

the point of the simulation is not to produce pretty pictures

it needs to produce what the camera in the car would see

and what other sensors would see

here we are stepping through different exposure settings of the real camera on the left side

and the simulation on the right side

we're able to match what the real cameras do

in order to do this we had to model a lot of the properties of the camera

in our sensor simulation starting from sensor noise motion blur optical distortions even headlight transmissions even like diffraction patterns of the wind shield etc

we don't use this just for the autopilot software

we also use it to make hardware decisions such as

lens design

camera design

sensor placement or even headlight transmission properties

f:id:stockbh:20211110225857p:plain

second we need to render the visuals in a realistic manner

you cannot have what in the game industry called jaggies

these are aliasing(エイリアシング) artifacts that are a dead giveaway

this is simulation we don't want them

so we go through a lot of paints to produce a nice special temporal anti-aliasing

we also are working on neural rendering techniques(ニューラル・レンダリング) to make this even more realistic

in addition we also used Ray-tracing to produce realistic lighting and global illumination

f:id:stockbh:20211110230345p:plain

we obviously need more than four or five cars

because the network will easily overfit（過学習、過剰最適化）

because it knows the sizes

so we need to have realistic assets like the moves on the road

we have thousands of assets in our library

and they can wear different shirts and actually can move realistically

we also have a lot of different locations mapped and created environments

we are actually 2000 miles of road built and this is almost the length of the roadway from the east coast to the west coast of the united states

in addition we have built efficient tooling to build several miles more on a single day on a single artist

but this is just tip of the iceberg

f:id:stockbh:20211110230543p:plain

actually as opposed to artists making these simulation scenarios

most of the data that we use to train is created procedurally using algorithms

these are all procedurally created roads with lots of parameters

such as curvature various trees cones poles cars with different velocities

and the interaction produce an endless stream of data for the network

but a lot of this data can be boring because the network may already get it correct

what we do is we also use ML based techniques to put up

for the network to see where it's failing at and to create more data around the failure points of the network

we try to make the network performance better in closed loop

f:id:stockbh:20211110230923p:plain

so in simulation, we want to recreate any failures that happens to the autopilot

on the left side you're seeing a real clip that was collected from a car

it then goes through our auto labeling pipeline to produce a 3D reconstruction of the scene

along with all the moving objects combined with the original visual information

we recreate the same scene synthetically and create a simulation scenario entirely out of it

and then when we replay autopilot on it

autopilot can do entirely new things and

we can form new worlds new outcomes from the original failure

this is amazing because we don't want autopilot to fail in actual fleet

when it fails we want to capture it and keep it to that bar

〇機械学習によるレンダリングの向上

f:id:stockbh:20211110231225p:plain

we can also use neural rendering techniques to make it look even more realistic

we take the original video clip

we create a synthetic simulation from it and then apply neural rendering techniques on top of it

this one is very realistic and looks like it was captured by the actual cameras

i'm very excited for what simulation can achieve

but this is not all because networks trained in the car already used simulation data

we used 300million images(3億枚) with almost half a billion labels(5億ラベル)

and we want to crush down all the tasks that are going to come up for the next several months

with that I invite ミラン to explain how we scale these operations and really build a label factory and spit out millions of labels

f:id:stockbh:20211110231451p:plain