孙愉亚 のすべての投稿

週報(SUN YUYA)

Contine reading papers about long term tracking.

  1. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

The paper try to modify the superdimp to a long-term tracker. It present a global search method and

(1) Baseline tracker using random erasing.

Method: Erase a random small rectangular areas of image to confirm whether the prediction is reliable.

Evaluation:I hope it  works.

(2) Global search using random searching.

Method: First, we create global searching templates with a predetermined interval. Next, we adaptively determine the number of searches according to the ratio of the image size to the target size. Then, an object is detected within a randomly selected searching area.

(3) Score penalty.

However, the probability of an object disappearing and suddenly appearing at a distant location is very low. To prevent this sudden detection, we penalize a confidence score through spatio-temporal constraints, which is expressed as follows:

週報(SUN YUYA)

The details of some long-term trackers.

1 . SiamX: An Efficient Long-term Tracker Using Cross-level Feature Correlation and Adaptive Tracking Scheme.

The key is “ADAPTIVE TRACKING SCHEME”.

(1)Momentum Compensation.

Exploit the concept “fast motion” to judge whether the target object is lost.

“If the target displacements between consecutive frames exceeds target sizes, it considers the target object is at a fast-moving state. To avoid targets leaving the search regions, the search center drifts in the direction of momentum:”

conclusion: Fake paper. Its codes lack the long-term tracker.

2. Combining complementary trackers for enhanced long-term visual object tracking.

Running two trackers.

But we can use its score’s method to re-detect.

3. GUSOT: Green and Unsupervised Single Object Tracking for Long Video Sequences

if  s1(f∗, x1) > s1(f∗, x2) and s2(f∗, x1) ≤ s2(f∗, x2) :

re-detect else: continue.

Key: motion residual. The key is “UHP-SOT”

 

4. High-Performance Long-Term Tracking with Meta-Updater

(1) appearance model (lstm)

(2) re-detection( the flag of DiMP ? )

Conclusion: Another fake paper. The most important point is DIMP !

 

5. UHP-SOT: An Unsupervised High-Performance Single Object Tracker(2017)

Methods: It has three trackers:

(1) Trajectories-based box prediction ( principal component analysis)

(2) Background motion modeling ( optical flow)

(3) Appearance model (normal tracker)

 

6. Object Tracking Using Background Subtraction and Motion Estimation in MPEG Videos (2005)

Key: Using four corner to compute the motion of background(Optical flow).

7. Fast Object Tracking Using Adaptive Block Matching(2005)

Key: Exploiting ‘Mode filter’ in order to straighten up noisy vectors (Optical flow) and thus eliminate this problem.

 

 

週報(SUN YUYA)

The long term object tracker requires the tracker to be able to retrieve lost targets. So I want to predict the possible locations where the target might appear based on the historical motion trajectory of the object.  Trajectory prediction requires the camera motion and current object tracking method can’t provide camera information,such as camera pose or motion.

In order to get depth map and camera poses, I am reading papers about slam with monocula camera, involving unsupervised learning.

  1. Future Person Localization in First-Person Videos

Purpose:  predicting future locations of people observed in first-person videos.

key point :a)  ego-motion  b) Scales of the target person. c) KCF for tracking d) feature concatenating.

evaluation: Excellent introductory work. But how to get ego-motion information?

2. Unsupervised Learning of Depth and Ego-Motion from Video

Purpose: Presenting an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences.

Key point: a) Visual synthesis b)  unsupervised learning.

Evaluation: Nice paper. But it still need camera intrinsics.

3. Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras

Purpose: Presenting a novel method for simultaneously learning depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as a supervision signal.

Key opint: a) Generating camera intrinsics.

Evaluation: Nice paper. Rrovide code. But it may be too slow.

週報(SUN YUYA)

(1)I am still learning how to compute the trajectory of object in monocula camera. The traditional object tracking task is just a simple object detection, ignoring depth and camera pose.

We can translate the task to the slam with monocula camera but we don’t know the camera intrinsics.

There are some unsupervised learning methods and I am reading these papers.

週報(SUN YUYA)

(1)Writing the paperof ICIAE2024.

( 2 )   In the experiments about long term tracking, I found that the similarity between templates and current appearance is not reliable. Because what we need  is the function that can identify whether two images are the same object, rather than the similarity distance between two images.

During tracking, the same object can have different appearance. The similarity distance between 0 and 1 are not suitable for judge.

So we should find another way in the field of image classification.

週報(SUN YUYA)

The experiements of re-detection with confidence score about long term tracking.

The confidence score  is the biggest score  to select the best location in each frame, which can be the basis of judge whether the target is lost. We can use the single confidence score or the sequenced  confidence score to evaluate the tracking process.

Mixformer is a excellent  short term trackers and  Unicorn is a excellent  glocal tracker. Our purpose aims to modify Mixformer to a long object trackers.  So we want to add a re-detection mechanism. When the mechanism judge the tracking is failed, the running tracker will exchange to the global tracker to find the lost target object.

In this experiments, the re-detection mechanism exploit the confidence score to judge whether the target is lost.

Lasot is the benchmark of long term object tracking and average confidence score of mixformer in the lasot is 0.58 when IOU is zero.

The experiments are as follows:

lasot                                            | Success | Precision  | Norm Precision | Mixformer-base                   | 0.711      | 0.757         |0.740                        | M_U_sc_30                              | 0.713      | 0.761         |0.743                        | M_U_sc_58                              | 0.711      | 0.758         |0.742                        | M_U_sc_sq_10_30               | 0.718      | 0.764         |0.749                        |  M_U_sc_sq_20_30               | 0.719      | 0.766         |0.749                        | M_U_sc_sq_50_30               | 0.718      | 0.765         |0.747                        | M_U_sc_sq_100_30            | 0.715      | 0.761         |0.745                        | M_U_sc_sq_200_30            | 0.714      | 0.760         |0.743                        | M_U_sc_psq_10_30            | 0.718      | 0.764         |0.748                        | M_U_sc_psq_20_30            | 0.718      | 0.765         |0.748                        | M_U_sc_sq_10_58               | 0.715      | 0.762         |0.745                        | M_U_sc_sq_200_58            | 0.715      | 0.763         |0.745                        |

Where M_U_sc_30 is the tracker using score under 0.3 and M_U_sc_58 is the tracker using score under 0.58. M_U_sc_sq_10_30  means that average ten confidence scores under 0.3. And Psq uses penalty.

The 0.3 may be more effective than 0.58.  The best length of sequence  may be 20. The penaly seems not effective. It’s too cumbersome to conduct more experiments on hyperparameters

In summary, we can see that exploiting confidence score to re-detect works but is not obvious. So in the next step, we can conduct experiemts on the similarity of templates and tracking object.

 

週報(SUN YUYA)

We find a lot of methods of re-detection about Long term object tracking.

1 . ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-term Tracking

The  re-detection method:

After obtaining the best candidate in each frame, our tracker treats the tracked object as present or absent based on its confidence score and then determines the search state (local search or global search) in the next frame.

Key: Training a deep network(Verifier).

similarity

2 . SiamX: An Efficient Long-term Tracker Using Cross-level Feature Correlation and Adaptive Tracking Scheme

The  re-detection method:

Exploiting threshold score.

3 . Combining complementary trackers for enhanced long-term visual object tracking

Running two trackers.

Training a deep network to decide when to re-detect.

 

 

4 . GUSOT: Green and Unsupervised Single Object Tracking for Long Video Sequences

Background motion estimation to predict location.

Evaluating two bboxes.

 

4 . High-Performance Long-Term Tracking with Meta-Updater

input: socre, response map,  target object, template

model: lstm

output: score

 5. MFT: Long-Term Tracking of Every Pixel

Optical flow is very important.

 

 6. Multi-Template Temporal Siamese Network for Long-Term Object Tracking

Predicting trajectoyies.

The  re-detection method:

Input: distance error

Output: Reliability Score

 7. Multi-Template Temporal Siamese Network for Long-Term Object Tracking

Predicting trajectoyies.

The  re-detection method:

Input: distance error

Output: Reliability Score

 

8. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter.

 

9. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter.

 

10. Target-Aware Tracking with Long-term Context Attention

The  re-detection method:

P-mean_score:

        score_list.append(out[“conf_score”])
        score_num+=1
        if score_num==10:
            n=score_num
            p_mean=0.0
            for i in range(n):
                for j in range(i,n):
                    p_mean+=(score_list[i]/(j+1))
            p_mean=(p_mean/n)
            score_num=0
            score_list=[]
            if p_mean<=0.58 :#and idx%5==0:
                print(“global start ! seq_name : “,seq_name, ” “,frame)
                out= global_tracker.track(image, info=None)
                boxes[frame, :] = out[“target_bbox”]
                local_tracker.state=boxes[frame, :]
Threshold:

11. Unifying Short and Long-Term Tracking with Graph Hierarchies

Difficult to understand.

 

週報(SUN YUYA)

This week I continue conducting experiments about long term tracking. Except for the experiments, I am ready for reading some important papers about long term tracking.

 

1. CoTracker: It is Better to Track Together

The paper’s purpose:

(1) Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance.

Contributions:

(1) In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video.

(2)It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories.

Personal Evaluation:

Release code. Very important. The innovation points are very innovative.The paper is related to optical flow and tragectory. We may use this paper to find the of  motion of environment.

 

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

(1)To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

(2)The  general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code.  The paper is related to 3D tracking.

3. RSPT: Reconstruct Surroundings and Predict Trajectories
for Generalizable Active Object Tracking

The paper’s purpose:

(1) However, building a generalizable active tracker that works robustly across different scenarios remains a challenge, especially in unstructured environments with cluttered obstacles and diverse layouts.

Contributions:

(1) To address this challenge, we present RSPT, a framework that forms a structure-aware motion representation by Reconstructing the Surroundings and Predicting the target Trajectory.

Personal Evaluation:

No  code. Don’t know how to exploit this paper.

 

4. Visual Prompt Multi-Modal Tracking

The paper’s purpose:

(1) Trajectory prediction.

Contributions:

(1)ARTrack tackles tracking as a coordinate sequence interpretation task that estimates object trajectories progressively, where the current estimate is induced by previous states and in turn affects subsequences.

(2)This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy.

Personal Evaluation:

Release  code. Worth reading.

5. Global Instance Tracking: Locating Target More Like Humans

The paper’s purpose:

(1)The massive gap indicates that researches only measure tracking performance rather than intelligence.

(2) Occlusion and fast motion.

Contributions:

(1) In this article, we first propose the global instance tracking (GIT) task, which is supposed to search an arbitrary user-specified instance in a video without any assumptions about camera or motion consistency, to model the human visual tracking ability.

(2) Whereafter, we construct a high-quality and large-scale benchmark VideoCube to create a challenging environment.

(3) Finally, we design a scientific evaluation procedure using human capabilities as the baseline to judge tracking intelligence.

(4) Additionally, we provide an online platform with toolkit and an updated leaderboard.

Personal Evaluation:

A new dataset benchmark. Maybe it’s very important.

6. Continuity-Aware Latent Inter frame Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network  can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

 

7. DropMAE: Masked Auto-encoders with Spatial-Attention Dropout for Tracking Tasks

The paper’s purpose:

(1) MAE (masked autoencoder ) heavily relies on spatial cues while ignoring temporal relations for frame reconstruction.

Contributions:

(1) DropMAE  adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.

(2) Improving the pre-training process.

Personal Evaluation:

Release code. The innovation points are very innovative. I can’t understand the paper.

 

8. Learning Historical Status Prompt for Accurate and Robust Visual Tracking

The paper’s purpose:

(1) However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame.

(2) The incapacity to integrate abundant and effective historical information.

Contributions:

(1) HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information.

Personal Evaluation:

No code. How to produce mask in tracking ?

 

9. Lightweight Full-Convolutional Siamese Tracker

The paper’s purpose:

(1) The current tracking model is too big.

Contributions:

(1) LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline.

(2) Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

10. LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

The paper’s purpose:

(1) Too big  and too slow.

Contributions:

The main innovations of LiteTrack encompass:

(1) Asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation.

(2) Pruning encoder layers from a heavy tracker to refine the balance between performance and speed.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

11. MixFormerV2: Efficient Fully Transformer Tracking

The paper’s purpose:

(1) Too slow.

Contributions:

(1) Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

(2)Then, we apply the unified transformer backbone on these mixed token sequence.

(3) Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

12 .Mobile Vision Transformer-based Visual Object Tracking

The paper’s purpose:

(1) Too slow and too big.

Contributions:

The main innovations of LiteTrack encompass:

(1) We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time.

(2) We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

13. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

The paper’s purpose:

(1) It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion.

Contributions:

(1)The encoder extracts visual features with a bidirectional transformer.

(2)While the decoder generates a sequence of bounding box values autoregressively with a causal transformer.

Personal Evaluation:

Release  code. Worth reading.

 

 

 

14. SOTVerse: A User-defined Task Space of Single Object Tracking

The paper’s purpose:

(1)The former causes existing datasets can not be exploited  comprehensively, while the latter neglects challenging factors
in the evaluation process.

Contributions:

(1)We first propose a 3E Paradigm to describe tasks by three components (i.e., environment, evaluation, and executor).

(2)Then, we summarize task characteristics, clarify the organization
standards, and construct SOTVerse with 12.56 million frames. Specifically, SOTVerse automatically labels challenging factors per frame, allowing users to generate user-defined spaces efficiently via construction rules.

(3) Besides, SOTVerse provides two mechanisms with new indicators and successfully evaluates trackers under various subtasks.

Personal Evaluation:

Release  code. Interesting Innovation. Worth reading.

 

15. Towards Grand Unification of Object Tracking

The paper’s purpose:

(1)We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Contributions:

(1) Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks.

Personal Evaluation:

Release  code. Nice paper. Worth reading.

 

16. Tracking through Containers and Occluders in the Wild

The paper’s purpose:

(1)Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems

Contributions:

(1) We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists.

(2) To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment.

Personal Evaluation:

Release  code. Nice paper. Worth reading.