週報(SUN YUYA)

The long term object tracker requires the tracker to be able to retrieve lost targets. So I want to predict the possible locations where the target might appear based on the historical motion trajectory of the object.  Trajectory prediction requires the camera motion and current object tracking method can’t provide camera information,such as camera pose or motion.

In order to get depth map and camera poses, I am reading papers about slam with monocula camera, involving unsupervised learning.

  1. Future Person Localization in First-Person Videos

Purpose:  predicting future locations of people observed in first-person videos.

key point :a)  ego-motion  b) Scales of the target person. c) KCF for tracking d) feature concatenating.

evaluation: Excellent introductory work. But how to get ego-motion information?

2. Unsupervised Learning of Depth and Ego-Motion from Video

Purpose: Presenting an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences.

Key point: a) Visual synthesis b)  unsupervised learning.

Evaluation: Nice paper. But it still need camera intrinsics.

3. Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras

Purpose: Presenting a novel method for simultaneously learning depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as a supervision signal.

Key opint: a) Generating camera intrinsics.

Evaluation: Nice paper. Rrovide code. But it may be too slow.

週報(QI ZIYANG)

SOT Summary

The work summary for the last week and this week is as follows: Reviewed the progress and main contributions of single object tracking algorithms over the years (2010~2023), summarizing the respective advantages of several representative algorithms. In conjunction with single object tracking, I explored the development history of 2D to 3D detection and tracking.

I learned how to use Inkscape to draw image with format of .svg and output results without distortion.

週報(QI Ziyang)

Center-based

The main contribution of CenterPoint lies in its adoption of a center-based detection head, which simplifies the process of dealing with object positions by not considering their intrinsic orientations. This significantly reduces the search space of the object detector, especially beneficial when vehicles drive on straight roads. Traditional anchor-based methods struggle to fit axis-aligned bounding boxes to rotated objects during critical maneuvers like a left turn.

Method

CenterPoint proposes representing, detecting, and tracking 3D objects as points. By regressing to 3D bounding boxes directly at the center point without voting, this method uses a single positive cell for each object alongside a keypoint estimation loss. A two-stage 3D detector with a Lidar-based backbone network identifies the centers of objects and their attributes, with a second stage refining all estimates.

Input

3D object detection aims to predict three-dimensional rotated bounding boxes. CenterPoint directly regresses to 3D bounding boxes through features at the center point without voting. This method employs a single positive cell for each object and utilizes a keypoint estimation loss. The two-stage detector extracts sparse features of 5 surface center points from the intermediate feature map.

Center-based Detection Head

Initially, six tasks are generated in the code, with the nuScenes dataset comprising 10 categories. Four tasks have two categories each. The separation of regression for ‘car’ and ‘pedestrian’ is due to the relatively large difference in size, allowing both to achieve higher accuracy more easily. Grouping ‘pedestrian’ and ‘traffic cone’ in the same task is because their sizes from a Bird’s Eye View (BEV) are similar, which can avoid generating two different categories of targets at the same position. Every task has the same head, and each head has a similar structure, including predictions for offset, height, dimensions, rotation angle, velocity, and a heatmap center.

Loss

  • Regression loss for dimensions, offset, height, rotation: The loss is calculated based on the differences between the predicted annotation boxes and the target boxes. The localization loss sums up all these component losses, and the total loss is a weighted sum of heatmap loss and localization loss.

3D Object Tracking

Many 2D tracking algorithms can directly track 3D objects out of the box. However, dedicated 3D trackers based on 3D Kalman filters still have an advantage as they better exploit the three-dimensional motion in a scene.

週報(SUN YUYA)

(1)I am still learning how to compute the trajectory of object in monocula camera. The traditional object tracking task is just a simple object detection, ignoring depth and camera pose.

We can translate the task to the slam with monocula camera but we don’t know the camera intrinsics.

There are some unsupervised learning methods and I am reading these papers.

Stay Hungry, Stay Foolish!