孙愉亚 のすべての投稿

週報(SUN YUYA)

(1)Writing the paperof ICIAE2024.

( 2 )   In the experiments about long term tracking, I found that the similarity between templates and current appearance is not reliable. Because what we need  is the function that can identify whether two images are the same object, rather than the similarity distance between two images.

During tracking, the same object can have different appearance. The similarity distance between 0 and 1 are not suitable for judge.

So we should find another way in the field of image classification.

週報(SUN YUYA)

The experiements of re-detection with confidence score about long term tracking.

The confidence score  is the biggest score  to select the best location in each frame, which can be the basis of judge whether the target is lost. We can use the single confidence score or the sequenced  confidence score to evaluate the tracking process.

Mixformer is a excellent  short term trackers and  Unicorn is a excellent  glocal tracker. Our purpose aims to modify Mixformer to a long object trackers.  So we want to add a re-detection mechanism. When the mechanism judge the tracking is failed, the running tracker will exchange to the global tracker to find the lost target object.

In this experiments, the re-detection mechanism exploit the confidence score to judge whether the target is lost.

Lasot is the benchmark of long term object tracking and average confidence score of mixformer in the lasot is 0.58 when IOU is zero.

The experiments are as follows:

lasot                                            | Success | Precision  | Norm Precision | Mixformer-base                   | 0.711      | 0.757         |0.740                        | M_U_sc_30                              | 0.713      | 0.761         |0.743                        | M_U_sc_58                              | 0.711      | 0.758         |0.742                        | M_U_sc_sq_10_30               | 0.718      | 0.764         |0.749                        |  M_U_sc_sq_20_30               | 0.719      | 0.766         |0.749                        | M_U_sc_sq_50_30               | 0.718      | 0.765         |0.747                        | M_U_sc_sq_100_30            | 0.715      | 0.761         |0.745                        | M_U_sc_sq_200_30            | 0.714      | 0.760         |0.743                        | M_U_sc_psq_10_30            | 0.718      | 0.764         |0.748                        | M_U_sc_psq_20_30            | 0.718      | 0.765         |0.748                        | M_U_sc_sq_10_58               | 0.715      | 0.762         |0.745                        | M_U_sc_sq_200_58            | 0.715      | 0.763         |0.745                        |

Where M_U_sc_30 is the tracker using score under 0.3 and M_U_sc_58 is the tracker using score under 0.58. M_U_sc_sq_10_30  means that average ten confidence scores under 0.3. And Psq uses penalty.

The 0.3 may be more effective than 0.58.  The best length of sequence  may be 20. The penaly seems not effective. It’s too cumbersome to conduct more experiments on hyperparameters

In summary, we can see that exploiting confidence score to re-detect works but is not obvious. So in the next step, we can conduct experiemts on the similarity of templates and tracking object.

 

週報(SUN YUYA)

We find a lot of methods of re-detection about Long term object tracking.

1 . ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-term Tracking

The  re-detection method:

After obtaining the best candidate in each frame, our tracker treats the tracked object as present or absent based on its confidence score and then determines the search state (local search or global search) in the next frame.

Key: Training a deep network(Verifier).

similarity

2 . SiamX: An Efficient Long-term Tracker Using Cross-level Feature Correlation and Adaptive Tracking Scheme

The  re-detection method:

Exploiting threshold score.

3 . Combining complementary trackers for enhanced long-term visual object tracking

Running two trackers.

Training a deep network to decide when to re-detect.

 

 

4 . GUSOT: Green and Unsupervised Single Object Tracking for Long Video Sequences

Background motion estimation to predict location.

Evaluating two bboxes.

 

4 . High-Performance Long-Term Tracking with Meta-Updater

input: socre, response map,  target object, template

model: lstm

output: score

 5. MFT: Long-Term Tracking of Every Pixel

Optical flow is very important.

 

 6. Multi-Template Temporal Siamese Network for Long-Term Object Tracking

Predicting trajectoyies.

The  re-detection method:

Input: distance error

Output: Reliability Score

 7. Multi-Template Temporal Siamese Network for Long-Term Object Tracking

Predicting trajectoyies.

The  re-detection method:

Input: distance error

Output: Reliability Score

 

8. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter.

 

9. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter.

 

10. Target-Aware Tracking with Long-term Context Attention

The  re-detection method:

P-mean_score:

        score_list.append(out[“conf_score”])
        score_num+=1
        if score_num==10:
            n=score_num
            p_mean=0.0
            for i in range(n):
                for j in range(i,n):
                    p_mean+=(score_list[i]/(j+1))
            p_mean=(p_mean/n)
            score_num=0
            score_list=[]
            if p_mean<=0.58 :#and idx%5==0:
                print(“global start ! seq_name : “,seq_name, ” “,frame)
                out= global_tracker.track(image, info=None)
                boxes[frame, :] = out[“target_bbox”]
                local_tracker.state=boxes[frame, :]
Threshold:

11. Unifying Short and Long-Term Tracking with Graph Hierarchies

Difficult to understand.

 

週報(SUN YUYA)

This week I continue conducting experiments about long term tracking. Except for the experiments, I am ready for reading some important papers about long term tracking.

 

1. CoTracker: It is Better to Track Together

The paper’s purpose:

(1) Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance.

Contributions:

(1) In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video.

(2)It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories.

Personal Evaluation:

Release code. Very important. The innovation points are very innovative.The paper is related to optical flow and tragectory. We may use this paper to find the of  motion of environment.

 

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

(1)To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

(2)The  general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code.  The paper is related to 3D tracking.

3. RSPT: Reconstruct Surroundings and Predict Trajectories
for Generalizable Active Object Tracking

The paper’s purpose:

(1) However, building a generalizable active tracker that works robustly across different scenarios remains a challenge, especially in unstructured environments with cluttered obstacles and diverse layouts.

Contributions:

(1) To address this challenge, we present RSPT, a framework that forms a structure-aware motion representation by Reconstructing the Surroundings and Predicting the target Trajectory.

Personal Evaluation:

No  code. Don’t know how to exploit this paper.

 

4. Visual Prompt Multi-Modal Tracking

The paper’s purpose:

(1) Trajectory prediction.

Contributions:

(1)ARTrack tackles tracking as a coordinate sequence interpretation task that estimates object trajectories progressively, where the current estimate is induced by previous states and in turn affects subsequences.

(2)This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy.

Personal Evaluation:

Release  code. Worth reading.

5. Global Instance Tracking: Locating Target More Like Humans

The paper’s purpose:

(1)The massive gap indicates that researches only measure tracking performance rather than intelligence.

(2) Occlusion and fast motion.

Contributions:

(1) In this article, we first propose the global instance tracking (GIT) task, which is supposed to search an arbitrary user-specified instance in a video without any assumptions about camera or motion consistency, to model the human visual tracking ability.

(2) Whereafter, we construct a high-quality and large-scale benchmark VideoCube to create a challenging environment.

(3) Finally, we design a scientific evaluation procedure using human capabilities as the baseline to judge tracking intelligence.

(4) Additionally, we provide an online platform with toolkit and an updated leaderboard.

Personal Evaluation:

A new dataset benchmark. Maybe it’s very important.

6. Continuity-Aware Latent Inter frame Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network  can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

 

7. DropMAE: Masked Auto-encoders with Spatial-Attention Dropout for Tracking Tasks

The paper’s purpose:

(1) MAE (masked autoencoder ) heavily relies on spatial cues while ignoring temporal relations for frame reconstruction.

Contributions:

(1) DropMAE  adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.

(2) Improving the pre-training process.

Personal Evaluation:

Release code. The innovation points are very innovative. I can’t understand the paper.

 

8. Learning Historical Status Prompt for Accurate and Robust Visual Tracking

The paper’s purpose:

(1) However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame.

(2) The incapacity to integrate abundant and effective historical information.

Contributions:

(1) HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information.

Personal Evaluation:

No code. How to produce mask in tracking ?

 

9. Lightweight Full-Convolutional Siamese Tracker

The paper’s purpose:

(1) The current tracking model is too big.

Contributions:

(1) LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline.

(2) Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

10. LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

The paper’s purpose:

(1) Too big  and too slow.

Contributions:

The main innovations of LiteTrack encompass:

(1) Asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation.

(2) Pruning encoder layers from a heavy tracker to refine the balance between performance and speed.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

11. MixFormerV2: Efficient Fully Transformer Tracking

The paper’s purpose:

(1) Too slow.

Contributions:

(1) Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

(2)Then, we apply the unified transformer backbone on these mixed token sequence.

(3) Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

12 .Mobile Vision Transformer-based Visual Object Tracking

The paper’s purpose:

(1) Too slow and too big.

Contributions:

The main innovations of LiteTrack encompass:

(1) We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time.

(2) We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

13. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

The paper’s purpose:

(1) It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion.

Contributions:

(1)The encoder extracts visual features with a bidirectional transformer.

(2)While the decoder generates a sequence of bounding box values autoregressively with a causal transformer.

Personal Evaluation:

Release  code. Worth reading.

 

 

 

14. SOTVerse: A User-defined Task Space of Single Object Tracking

The paper’s purpose:

(1)The former causes existing datasets can not be exploited  comprehensively, while the latter neglects challenging factors
in the evaluation process.

Contributions:

(1)We first propose a 3E Paradigm to describe tasks by three components (i.e., environment, evaluation, and executor).

(2)Then, we summarize task characteristics, clarify the organization
standards, and construct SOTVerse with 12.56 million frames. Specifically, SOTVerse automatically labels challenging factors per frame, allowing users to generate user-defined spaces efficiently via construction rules.

(3) Besides, SOTVerse provides two mechanisms with new indicators and successfully evaluates trackers under various subtasks.

Personal Evaluation:

Release  code. Interesting Innovation. Worth reading.

 

15. Towards Grand Unification of Object Tracking

The paper’s purpose:

(1)We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Contributions:

(1) Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks.

Personal Evaluation:

Release  code. Nice paper. Worth reading.

 

16. Tracking through Containers and Occluders in the Wild

The paper’s purpose:

(1)Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems

Contributions:

(1) We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists.

(2) To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment.

Personal Evaluation:

Release  code. Nice paper. Worth reading.

週報(SUN YUYA)

This week I continue conducting experiments about long term tracking. Except for the experiments, I still read the papers about long term object tracking.

Among these papers, there are two trackers are the practical, namely Unicorn and Mixformer2. The Unicorn is a tracker with global detection, which can locate the target object on the whole frame. The Mixformer2 are a high-speed and accurate tracker, which can be taken as local tracker. If  combining the advantage of two trackers, we can get a excellent long term tracker.

The evaluation of Unicorn and Mixformer2 on the open benchmark (LaSOT, TrackingNet and GOT-10k) are as follows:

Unicorn: 10 fps.

Mixformer2: 100fps

trackingnet | AUC | OP50 | OP75 | Precision | Norm Precision |
Unicorn | 95.24 | 100.00 | 100.00 | 100.00 | 100.00 |

got10k | AUC | OP50 | OP75 | Precision | Norm Precision |
Unicorn | 82.17 | 92.17 | 82.61 | 80.01 | 91.05 |

lasot | AUC | OP50 | OP75 | Precision | Norm Precision |
Unicorn | 67.49 | 77.64 | 64.81 | 73.00 | 75.52 |

lasot | AUC | OP50 | OP75 | Precision | Norm Precision |
MixFormerDeit-small | 60.75 | 71.73 | 53.66 | 60.91 | 68.98 |

lasot | AUC | OP50 | OP75 | Precision | Norm Precision |
MixFormerDeit-big | 69.95 | 81.65 | 68.83 | 75.11 | 79.69 |

 

 

The new papers are as follows:

I change my mind that I should read more new papers instead of conducting experiments in haste. The specific analysis of  papers are as follows:

1. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking

The paper’s purpose:

To slove the problem that traditional CNN is too slow.

Contributions:

(1) The paper proposes a framework, where feature learning and template-search coupling are integrated into an efficient one-stream ViT to avoid an extra heavy relation modeling module.

(2)The proposed Aba-ViT exploits an adaptive and background-aware token computation method to reduce inference time.

(3)This approach adaptively discards tokens based on learned halting probabilities, which a priori are higher for background tokens than target ones.

(4) Very Fast ! 180 fps! 

Personal Evaluation:

Too fast! And the paper provide the code. We can let it  act as main tracker.

 

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

(1)To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

(2)The  general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code.  The paper is related to 3D tracking.

3. Compact Transformer Tracker with Correlative Masked Modeling

The paper’s purpose:

(1) Proving that the traditional selfattention structure is sufficient for information aggregation, and structural adaption is unnecessary.

Contributions:

(1) The paper attaches a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens.

(2) The structure is very simple.

Personal Evaluation:

Release code.  The evaluation on benchmark is very high. Nice paper. But the analysis of network is beyond my ability.

 

4. Continuity-Aware Latent Inter frame Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network  can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

 

5. CiteTracker: Correlating Image and Text for Visual Tracking

The paper’s purpose:

(1)  track targets with drastic variations.

Contributions:

(1) Developing a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target.

(2) A dynamic description module is designed to adapt to target variations for more effective target representation.

Personal Evaluation:

Release a half code. The innovation points are very innovative.But I don’t know if it works.

6. CoTracker: It is Better to Track Together

The paper’s purpose:

(1) Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance.

Contributions:

(1) In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video.

(2)It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories.

Personal Evaluation:

Release code. Very important. The innovation points are very innovative.The paper is related to optical flow and tragectory. We may use this paper to find the of  motion of environment.

7. DropMAE: Masked Auto-encoders with Spatial-Attention Dropout for Tracking Tasks

The paper’s purpose:

(1) MAE (masked autoencoder ) heavily relies on spatial cues while ignoring temporal relations for frame reconstruction.

Contributions:

(1) DropMAE  adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.

(2) Improving the pre-training process.

Personal Evaluation:

Release code. The innovation points are very innovative. I can’t understand the paper.

9. Efficient Training for Visual Tracking with Deformable Transformer

The paper’s purpose:

(1) In order to solve the problems that  resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference, the paper want to  improve inefficient training methods and convolution based target heads.

Contributions:

(1)Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs.

(2)For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model’s convergence.

Personal Evaluation:

No code.

9. Efficient Visual Tracking with Exemplar Transformers

The paper’s purpose:

(1) However, in the pursuit of increased tracking performance, runtime is often hindered.
(2) Furthermore, efficient tracking architectures have received
surprisingly little attention.

Contributions:

(1) E.T.Track, our visual tracker that incorporates Exemplar Transformer modules, runs at 47 FPS on a CPU.

(2)In this paper, we introduce the Exemplar Transformer, a transformer module utilizing a single instance level attention layer for realtime visual object tracking.

Personal Evaluation:

Release code. Very fast and the architecture is very simple. But the evaluation seems not very good.

 

11. Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking

The paper’s purpose:

(1) Throw  the backbone away.

Contributions:

(1) bridge the gap between single-branch network and discriminative models.

(2) bolsters the tracking robustness in the presence of distractors via bidirectional cycle tracking verification.

Personal Evaluation:

No code. Old innovation point.

10. Exploiting Image-Related Inductive Biases in Single-Branch Visual Tracking

The paper’s purpose:

(1) However, existing trackers are hampered by low speed, limiting their applicability on devices with limited computational power

Contributions:

(1) The central idea of HiT is the Bridge Module, which bridges the gap between modern lightweight transformers and the tracking framework.

(2)We also propose a novel dual-image position encoding technique that simultaneously encodes the position information of both the search region and template images.

Personal Evaluation:

Release a half  code. Old innovation point.

61 fps is not sufficient.

 

11. Generalized Relation Modeling for Transformer Tracking

The paper’s purpose:

(1) Existing onestream trackers always let the template interact with all parts inside the search region throughout all the encoder layers.

Contributions:

(1) Enabling more flexible relation modeling by selecting appropriate search tokens to interact with template token.

(2)An attention masking strategy and the Gumbel-Softmax technique are introduced to facilitate the parallel computation and end-to-end learning of the token division module.

Personal Evaluation:

Release  code. The innovation point is about structure of network.

 

12. Global Instance Tracking: Locating Target More Like Humans

The paper’s purpose:

(1)The massive gap indicates that researches only measure tracking performance rather than intelligence.

(2) Occlusion and fast motion.

Contributions:

(1) In this article, we first propose the global instance tracking (GIT) task, which is supposed to search an arbitrary user-specified instance in a video without any assumptions about camera or motion consistency, to model the human visual tracking ability.

(2) Whereafter, we construct a high-quality and large-scale benchmark VideoCube to create a challenging environment.

(3) Finally, we design a scientific evaluation procedure using human capabilities as the baseline to judge tracking intelligence.

(4) Additionally, we provide an online platform with toolkit and an updated leaderboard.

Personal Evaluation:

A new dataset benchmark. Maybe it’s very important.

 

13. Learning Historical Status Prompt for Accurate and Robust Visual Tracking

The paper’s purpose:

(1) However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame.

(2) The incapacity to integrate abundant and effective historical information.

Contributions:

(1) HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information.

Personal Evaluation:

No code. How to produce mask in tracking ?

 

14.Leveraging the Power of Data Augmentation for Transformer-based Tracking

The paper’s purpose:

(1) Previous works focus on designing effective architectures suited
for tracking, but ignore that data augmentation is equally crucial for training a well-performing model.

Contributions:

(1) We optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples.

(2) We propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.

Personal Evaluation:

No code. It’s diffcult to add a new tracking toolkit.

 

15.Lightweight Full-Convolutional Siamese Tracker

The paper’s purpose:

(1) The current tracking model is too big.

Contributions:

(1) LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline.

(2) Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

 

16.LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

The paper’s purpose:

(1) Too big  and too slow.

Contributions:

The main innovations of LiteTrack encompass:

(1) Asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation.

(2) Pruning encoder layers from a heavy tracker to refine the balance between performance and speed.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

17.MixFormerV2: Efficient Fully Transformer Tracking

The paper’s purpose:

(1) Too slow.

Contributions:

(1) Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

(2)Then, we apply the unified transformer backbone on these mixed token sequence.

(3) Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

18.Mobile Vision Transformer-based Visual Object Tracking

The paper’s purpose:

(1) Too slow and too big.

Contributions:

The main innovations of LiteTrack encompass:

(1) We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time.

(2) We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

 

19.OmniTracker: Unifying Object Tracking by Tracking-with-Detection

The paper’s purpose:

(1) Combing SOT and MOT.

Contributions:

(1) We propose a novel tracking-with detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for association.

(2)Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and
inference pipeline.

Personal Evaluation:

No  code. Useless.

 

20.Robust Object Modeling for Visual Tracking

The paper’s purpose:

(1) However, separate template learning lacks communication between the template and search regions, which brings difficulty in extracting discriminative target-oriented features.

(2)On the other hand, interactive template learning produces hybrid template features, which may introduce potential distractors to the template via the cluttered search regions

Contributions:

(1)To enjoy the merits of both methods, we propose a robust object modeling framework for visual tracking (ROMTrack), which simultaneously models the inherent template and the hybrid template features.

(2)To further enhance robustness, we present novel variation tokens to depict the ever-changing appearance of target objects.

Personal Evaluation:

Release  code. There is too much nonsense in this paper. Normal paper.

 

21.RSPT: Reconstruct Surroundings and Predict Trajectories
for Generalizable Active Object Tracking

The paper’s purpose:

(1) However, building a generalizable active tracker that works robustly across different scenarios remains a challenge, especially in unstructured environments with cluttered obstacles and diverse layouts.

Contributions:

(1) To address this challenge, we present RSPT, a framework that forms a structure-aware motion representation by Reconstructing the Surroundings and Predicting the target Trajectory.

Personal Evaluation:

No  code. Don’t know how to exploit this paper.

 

22.Segment and Track Anything

The paper’s purpose:

(1) To precisely and effectively segment and track any object in a video.

Contributions:

(1)Additionally, SAM-Track employs multimodal interaction methods that enable users to select multiple objects in videos for tracking, corresponding to their specific requirements.

(2) As a result, SAM-Track can be used across an array of fields, ranging from drone technology, autonomous driving, medical imaging, augmented reality, to biological analysis.

Personal Evaluation:

Release  code. I don’t need the tracker with too much function.

 

23.Separable Self and Mixed Attention Transformers for Efficient Object Tracking

The paper’s purpose:

(1) However, the transformer-based models are underutilized for Siamese lightweight tracking due to the computational complexity of their attention blocks.

Contributions:

(1)This paper proposes an efficient self and mixed attention transformerbased architecture for lightweight tracking.

(2)Our prediction head performs global contextual modeling of the encoded features by leveraging efficient self-attention blocks
for robust target state estimation.

Personal Evaluation:

Release  code. Normal paper.

 

24. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

The paper’s purpose:

(1) It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion.

Contributions:

(1)The encoder extracts visual features with a bidirectional transformer.

(2)While the decoder generates a sequence of bounding box values autoregressively with a causal transformer.

Personal Evaluation:

Release  code. Worth reading.

 

25.SiamTHN: Siamese Target Highlight Network for Visual Tracking

The paper’s purpose:

(1)The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region.

(2) Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training.

(3)Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results.

Contributions:

(1) In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region.
(2) To reduce the misalignment and produce more precise tracking
results, we propose a corrective loss to train the model.

Personal Evaluation:

No  code. Normal paper.

 

26.SOTVerse: A User-defined Task Space of Single Object Tracking

The paper’s purpose:

(1)The former causes existing datasets can not be exploited  comprehensively, while the latter neglects challenging factors
in the evaluation process.

Contributions:

(1)We first propose a 3E Paradigm to describe tasks by three components (i.e., environment, evaluation, and executor).

(2)Then, we summarize task characteristics, clarify the organization
standards, and construct SOTVerse with 12.56 million frames. Specifically, SOTVerse automatically labels challenging factors per frame, allowing users to generate user-defined spaces efficiently via construction rules.

(3) Besides, SOTVerse provides two mechanisms with new indicators and successfully evaluates trackers under various subtasks.

Personal Evaluation:

Release  code. Interesting Innovation. Worth reading.

 

27.Towards Efficient Training with Negative Samples in Visual Tracking

The paper’s purpose:

(1)Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting.

Contributions:

(1) To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target’s location in the presence of negative samples, offering an efficient way to manage the mixed sample training.

(2) Our approach introduces a target-indicating token,encapsulating the target’s precise location within the template image.

Personal Evaluation:

No  code. Normal paper.

 

28. Towards Grand Unification of Object Tracking

The paper’s purpose:

(1)We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Contributions:

(1) Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks.

Personal Evaluation:

Release  code. Nice paper. Worth reading.

 

29. Tracking through Containers and Occluders in the Wild

The paper’s purpose:

(1)Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems

Contributions:

(1) We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists.

(2) To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment.

Personal Evaluation:

Release  code. Nice paper. Worth reading.

 

30. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking

The paper’s purpose:

(1)However, the dynamic changes in flight maneuver and viewpoint encountered in UAV tracking pose significant difficulties, e.g., aspect ratio change, and scale variation.

(2) The conventional cross-correlation operation, while commonly used, has limitations in effectively capturing perceptual similarity and incorporates extraneous background information.

Contributions:

(1) The proposed method designs a new task-specific object saliency
mining network to refine the cross-correlation operation and
effectively discriminate foreground and background information.

(2) Additionally, a saliency adaptation embedding operation dynamically generates tokens based on initial saliency, thereby
reducing the computational complexity of the Transformer architecture.

(3) Finally, a lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.

Personal Evaluation:

Release  code. Normal paper.

 

31. Transparent Object Tracking with Enhanced Fusion Module

The paper’s purpose:

(1)Accurate tracking of transparent objects, such as glasses, plays a critical role in many robotic tasks such as robot-assisted living.

(2) Due to the adaptive and often reflective texture of such objects, traditional tracking algorithms that rely on general-purpose learned features suffer from reduced performance.

(3) For example, many of the current days’ transformer-based trackers are fully pre-trained and are sensitive to any latent space perturbations.

Contributions:

(1)Our proposed fusion module, composed of a transformer encoder and an MLP module, leverages key query-based transformations to
embed the transparency information into the tracking pipeline.

(2)We also present a new two-step training strategy for our fusion
module to effectively merge transparency features.

Personal Evaluation:

Release  a half  code. Normal paper.

 

32. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

The paper’s purpose:

(1) Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks.

Contributions:

(1)We design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and
mitigate performance loss.

Personal Evaluation:

Release  a half  code. Normal paper.

 

33. Visual Prompt Multi-Modal Tracking

The paper’s purpose:

(1) Trajectory prediction.

Contributions:

(1)ARTrack tackles tracking as a coordinate sequence interpretation task that estimates object trajectories progressively, where the current estimate is induced by previous states and in turn affects subsequences.

(2)This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy.

Personal Evaluation:

Release  code. Worth reading.

 

34. VideoTrack: Learning to Track Objects via Video Transformer

The paper’s purpose:

(1) Existing Siamese tracking methods, which are built on pair-wise matching between two single frames, heavily rely on additional sophisticated mechanism to exploit temporal information among successive video frames, hindering them from efficiency and industrial deployments.

Contributions:

(1)Specifically, we adapt the standard video transformer architecture to visual tracking by enabling spatiotemporal feature learning directly from frame-level patch sequences.

(2) To better adapt to the tracking task, we carefully blend the spatiotemporal information in the video clips through sequential multi-branch triplet blocks, which formulates a video transformer backbone.

(3) Our experimental study compares different model variants, such as tokenization strategies, hierarchical structures, and video attention schemes.

(4) Then, we propose a disentangled dual-template mechanism that decouples static and dynamic appearance clues over time, and reduces temporal redundancy in video frames.

Personal Evaluation:

No  code. Could not understand the paper

 

35. Foreground-Background Distribution Modeling Transformer for Visual Object Tracking

The paper’s purpose:

(1) However, the feature learning of these Transformer-based trackers is easily disturbed by complex backgrounds.

Contributions:

(1)To address the above limitations, we propose a novel foreground-background distribution modeling transformer for visual object tracking (FBDMTrack), including a fore-background agent learning
(FBAL) module and a distribution-aware attention (DA2) module in a unified transformer architecture.

(2)The proposed F-BDMTrack enjoys several merits. First, the proposed FBAL module can effectively mine fore-background information with designed fore-background agents.

(3) Second, the DA2 module can suppress the incorrect interaction between foreground and background by modeling fore-background
distribution similarities.

(4) Finally, F-BDMTrack can extract discriminative features under ever-changing tracking scenarios for more accurate target state estimation.

Personal Evaluation:

No  code. Could not understand the paper.

 

36. Representation Learning for Visual Object Tracking by Masked Appearance Transfer

The paper’s purpose:

(1) However, few works study the trackingspecified representation learning method. Most trackers directly use ImageNet pre-trained representations.

Contributions:

(1) However, for the template, we make the decoder reconstruct the target appearance within the search region.

(2) We randomly mask out the inputs, thereby making the learned representations more discriminative.

Personal Evaluation:

Release  code. Complex code.

 

37. ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

The paper’s purpose:

In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size.

Contributions:

(1) To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa.

(2)This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size.

(3) Our formulation for the nonuniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers.

Personal Evaluation:

Release  a half code. Normal paper.

週報(SUN YUYA)

This week I start conducting experiments about long term tracking. Except for the experiments, I still read some papers about long term object tracking.

The experiments are as follows: On the basis of a simple siamese tracker(SiamFC++), I modified it to a long term tracker,which needs a  mechanism to judge whether the target object is absent. The most concise way is to exploit the confidence score. So I want to find a relationship between the score and IOU. The specific results are shown in the figure.

From the picture, we can see that the fluctuation trend of score is basically consistent with that of iou. But only relying on the score is not enough. So we may conduct experiments on  the sequence of scores.

 

The new papers are as follows:

I change my mind that I should read more new papers instead of conducting experiments in haste. The specific analysis of  papers are as follows:

1. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking

The paper’s purpose:

To slove the problem that traditional CNN is too slow.

Contributions:

(1) The paper proposes a framework, where feature learning and template-search coupling are integrated into an efficient one-stream ViT to avoid an extra heavy relation modeling module.

(2)The proposed Aba-ViT exploits an adaptive and background-aware token computation method to reduce inference time.

(3)This approach adaptively discards tokens based on learned halting probabilities, which a priori are higher for background tokens than target ones.

(4) Very Fast ! 180 fps!

Personal Evaluation:

Too fast! And the paper provide the code. We can let it  act as main tracker.

 

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

(1)To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

(2)The  general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code.  The paper is related to 3D tracking.

3. Compact Transformer Tracker with Correlative Masked Modeling

The paper’s purpose:

(1) Proving that the traditional selfattention structure is sufficient for information aggregation, and structural adaption is unnecessary.

Contributions:

(1) The paper attaches a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens.

(2) The structure is very simple.

Personal Evaluation:

Release code.  The evaluation on benchmark is very high. Nice paper. But the analysis of network is beyond my ability.

 

4. Continuity-Aware Latent Interframe Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network  can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

週報(SUN YUYA)

This week I am reading  some papers about long term object tracking  and downloading  related tracking datasets. Compared to  normal object tracking datasets, the long term object datasets lack sources and are hard to download. Everything needs to start from scratch.

I want to find a accurate method to evaluate whether the tracked target is absent . Most of paper exploit  confience score to judge the suituation of tracking performence but I think it is not realiable.  The tragectory predictor is expected to correct the tracker. But only one paper adopts this way, using a distribution to guess a location.

And the metrics of long term tracking  are F1-score, Precision and Recall,which differs from single object tracking. But these papers did not test on the same benchmark, which make me diffcult to evaluate the trackers. (The most usual benchmark are VOT2020LT. )

The specific analysis of  papers are as follows:

1. Combining complementary trackers for enhanced long-term visual object tracking

The paper’s purpose:

Cmbining the capabilities of baseline trackers in the context of long-term visual object tracking.

Contributions:

(1) Proposing a strategy which can perceive whether the two trackers are following the target object through an online learned deep verification model to  select the best performing tracker as well as it corrects their performance when failing.

Personal Evaluation:

The long term tracker usually own the ability to switch the local tracker and global tracker.This paper may propose a novel way.

Details:

There are two trackers to run, namely Stark and SuperDiMP.

Stark can predict a more accurate bounding box.

SuperDiMP can output a more score to judge whether target is lost.

The two trackers are running together and the paper propose a policy to select which tracker is better in the t th frame.

The judgement are decided by confidence score and the verifier.

The verifer are trained offline and updated by image patch online.

Deep evaluation: Very simple design and easy to reproduce. The key technology is a simple deep binary classification. The training data can be collected from training process. But the paper did not provide code and I don’t know if it really work.

The  flow chart are as follows:

 

 

2. Target-Aware Tracking with Long-term Context Attention

The paper’s purpose:

(1) Unlike siamese tracker,  exploiting contextual information.

(2) Coping with large appearance changes, rapid target movement, and attraction from similar objects.

Contributions:

(1) Proposing a LCA (transformer ) which uses the target state from the previous frame to exclude the interference of similar objects and complex backgrounds.

(2) Proposing a concise and efficient online updating approach based on classification confidence to select highquality templates with very low computation burden.

Personal Evaluation:

A normal tracker with new transformer.

 

Details:

The first contribution rely on the Swin transformer.

The template update mechanisms is key contributions. It exploits the continuous confidence score. The normal mean of T templates are  not enough and it use the threshold and penalty.

Deep evalutaions:

Its contributions is too simple and easy to reproce.

The paper provide the code and I should read it.

 

3. Multi-Template Temporal Siamese Network for Long-Term
Object Tracking

The paper’s purpose:

(1) Avoiding defects of  siamese tracker.

(2) Coping with target appearance changes and similar objects .

Contributions:

(1) Learning the path history by a bag of templates

(2) Projecting a potential future target location in a next frame.

Personal Evaluation: Competitor! Consensus of ideas. Worth reading.

Details:

Using the confidence score of RPN head to update templates. The first mechanism exploit  confidence score and  weights to compute the similarity.

The second mechanism is to predict possible bbox but it is a passing. It should be the most important part.

The second mechanism exploit the latent bbox and similarity to compute a reliability score.

Deep evaluation: Normal paper. But it provide the code. So I can read the code the find the part of predicting bbox.

 

 

4. SiamX: An Efficient Long-term Tracker Using Cross-level Feature
Correlation and Adaptive Tracking Scheme

The paper’s purpose:

(1) Improving siamese tracker.

(2) large variance, presence of distractors, fast motion, or target disappearing and the like .

Contributions:

(1) Exploiting cross-level Siamese features to learn robust
correlations between the target template and search regions.

(2) Proposing inference strategies to prevent tracking loss
and realize fast target re-localization.

Personal Evaluation: Normal. I want to know how to re-locate. Worth reading.

Details: The key contributions is to compute the probability
distribution of the target and maintain a heat-map. The heatmap is only used to judge if the target is loss. I dont know how to  search the most possible region.

Deep evalutions: No code. The key contributions seems not reliable.

 

5. ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-term Tracking

The paper’s purpose:

(1) Traditional long-term tracker are limited.

(2) To find the missing target object and accerlate the speed of global search.

Contributions:

(1) Determining whether the tracked object being present or absent, and then chooses the tracking strategies of local search or global search respectively in the next frame.

(2) Speeding up the image-wide global search,
a novel skimming module is designed to efficiently choose
the most possible regions from a large number of sliding
windows.

Personal Evaluation: The global search  is useful. Worth reading.

Details: The key contributions is that the skimming global seach.

Another contributions  is to judge whether the target is loss.

Its approach expoits the similarity between template and candidate bboxes to judge whether the target is loss.

Deep evalution: The judge mechanism is rough but efficient. It provide the code. I want to check the skimming global search.

週報(SUN YUYA)

I am reading  some novel papers about long term object tracking in 2023.

Compare to  normal object tracking, the long term object tracking must have the ability to find the missing target object. The papers are as follows:

  1. “MFT: Long-Term Tracking of Every Pixel”

The paper’s purpose:

Solving the challenges, such as dense,  pixel-level, long-term tracking.

Contributions:

(1) The approach exploits optical flows estimated not only between consecutive frames, but also for pairs of frames at logarithmically spaced intervals.

(2) It then selects the most reliable sequence of flows on the basis of estimates of its geometric accuracy and the probability of occlusion, both provided by a pre-trained CNN.

Personal Evaluation:

The paper exploits a new routine, which is worth reading.

2. Combining complementary trackers for enhanced long-term visual object tracking

The paper’s purpose:

Cmbining the capabilities of baseline trackers in the context of long-term visual object tracking.

Contributions:

(1) Proposing a strategy which can perceive whether the two trackers are following the target object through an online learned deep verification model to  select the best performing tracker as well as it corrects their performance when failing.

Personal Evaluation:

The long term tracker usually own the ability to switch the local tracker and global tracker.This paper may propose a novel way.

3. Target-Aware Tracking with Long-term Context Attention

The paper’s purpose:

(1) Unlike siamese tracker,  exploiting contextual information.

(2) Coping with large appearance changes, rapid target movement, and attraction from similar objects.

Contributions:

(1) Proposing a LCA (transformer ) which uses the target state from the previous frame to exclude the interference of similar objects and complex backgrounds.

Personal Evaluation:

A normal tracker with new transformer.

4. Multi-Template Temporal Siamese Network for Long-Term
Object Tracking

The paper’s purpose:

(1) Avoiding defects of  siamese tracker.

(2) Coping with target appearance changes and similar objects .

Contributions:

(1) Learning the path history by a bag of templates

(2) Projecting a potential future target location in a next frame.

Personal Evaluation: Competitor! Consensus of ideas.Worth reading.

5. SiamX: An Efficient Long-term Tracker Using Cross-level Feature
Correlation and Adaptive Tracking Scheme

The paper’s purpose:

(1) Improving siamese tracker.

(2) large variance, presence of distractors, fast motion, or target disappearing and the like .

Contributions:

(1) Exploiting cross-level Siamese features to learn robust
correlations between the target template and search regions.

(2) Proposing inference strategies to prevent tracking loss
and realize fast target re-localization.

Personal Evaluation: Normal. I want to know how to re-locate. Worth reading.

6. ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-term Tracking

The paper’s purpose:

(1) Traditional long-term tracker are limited.

(2) To find the missing target object and accerlate the speed of global search.

Contributions:

(1) Determining whether the tracked object being present or absent, and then chooses the tracking strategies of local search or global search respectively in the next frame.

(2) Speeding up the image-wide global search,
a novel skimming module is designed to efficiently choose
the most possible regions from a large number of sliding
windows.

Personal Evaluation: The global search  is useful. Worth reading.