孙愉亚のすべての投稿

週報(SUN YUYA)

2024年3月13日孙愉亚コメントする

The long term object tracker requires the tracker to be able to retrieve lost targets. So I want to predict the possible locations where the target might appear based on the historical motion trajectory of the object. Trajectory prediction requires the camera motion and current object tracking method can’t provide camera information,such as camera pose or motion.

In order to get depth map and camera poses, I am reading papers about slam with monocula camera, involving unsupervised learning.

Future Person Localization in First-Person Videos

Purpose: predicting future locations of people observed in first-person videos.

key point :a) ego-motion b) Scales of the target person. c) KCF for tracking d) feature concatenating.

evaluation: Excellent introductory work. But how to get ego-motion information?

2. Unsupervised Learning of Depth and Ego-Motion from Video

Purpose: Presenting an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences.

Key point: a) Visual synthesis b) unsupervised learning.

Evaluation: Nice paper. But it still need camera intrinsics.

3. Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras

Purpose: Presenting a novel method for simultaneously learning depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as a supervision signal.

Key opint: a) Generating camera intrinsics.

Evaluation: Nice paper. Rrovide code. But it may be too slow.

研究進捗

週報(SUN YUYA)

2024年2月22日孙愉亚コメントする

（1）I am still learning how to compute the trajectory of object in monocula camera. The traditional object tracking task is just a simple object detection, ignoring depth and camera pose.

We can translate the task to the slam with monocula camera but we don’t know the camera intrinsics.

There are some unsupervised learning methods and I am reading these papers.

研究進捗

週報(SUN YUYA)

2024年1月31日孙愉亚コメントする

（1）In the field of object tracking, I am making datasets and designing a deep network to evaluate the similarity between template and current object appearance.

（2）I am searching the paper about trajectory prediction and camera motion evaluation

研究進捗

週報(SUN YUYA)

2024年1月17日孙愉亚コメントする

I have finished the paper of ICIAE2024.

研究進捗

週報(SUN YUYA)

2024年1月10日孙愉亚コメントする

（1）Writing the paperof ICIAE2024.

( 2 ) In the experiments about long term tracking, I found that the similarity between templates and current appearance is not reliable. Because what we need is the function that can identify whether two images are the same object, rather than the similarity distance between two images.

During tracking, the same object can have different appearance. The similarity distance between 0 and 1 are not suitable for judge.

So we should find another way in the field of image classification.

研究進捗

週報(SUN YUYA)

2023年12月27日孙愉亚コメントする

The experiements of re-detection with confidence score about long term tracking.

The confidence score is the biggest score to select the best location in each frame, which can be the basis of judge whether the target is lost. We can use the single confidence score or the sequenced confidence score to evaluate the tracking process.

Mixformer is a excellent short term trackers and Unicorn is a excellent glocal tracker. Our purpose aims to modify Mixformer to a long object trackers. So we want to add a re-detection mechanism. When the mechanism judge the tracking is failed, the running tracker will exchange to the global tracker to find the lost target object.

In this experiments, the re-detection mechanism exploit the confidence score to judge whether the target is lost.

Lasot is the benchmark of long term object tracking and average confidence score of mixformer in the lasot is 0.58 when IOU is zero.

The experiments are as follows:

lasot | Success | Precision | Norm Precision | Mixformer-base | 0.711 | 0.757 |0.740 | M_U_sc_30 | 0.713 | 0.761 |0.743 | M_U_sc_58 | 0.711 | 0.758 |0.742 | M_U_sc_sq_10_30 | 0.718 | 0.764 |0.749 | M_U_sc_sq_20_30 | 0.719 | 0.766 |0.749 | M_U_sc_sq_50_30 | 0.718 | 0.765 |0.747 | M_U_sc_sq_100_30 | 0.715 | 0.761 |0.745 | M_U_sc_sq_200_30 | 0.714 | 0.760 |0.743 | M_U_sc_psq_10_30 | 0.718 | 0.764 |0.748 | M_U_sc_psq_20_30 | 0.718 | 0.765 |0.748 | M_U_sc_sq_10_58 | 0.715 | 0.762 |0.745 | M_U_sc_sq_200_58 | 0.715 | 0.763 |0.745 |

Where M_U_sc_30 is the tracker using score under 0.3 and M_U_sc_58 is the tracker using score under 0.58. M_U_sc_sq_10_30 means that average ten confidence scores under 0.3. And Psq uses penalty.

The 0.3 may be more effective than 0.58. The best length of sequence may be 20. The penaly seems not effective. It’s too cumbersome to conduct more experiments on hyperparameters

In summary, we can see that exploiting confidence score to re-detect works but is not obvious. So in the next step, we can conduct experiemts on the similarity of templates and tracking object.

研究進捗

週報(SUN YUYA)

2023年12月13日孙愉亚コメントする

We find a lot of methods of re-detection about Long term object tracking.

1 . ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-term Tracking

The re-detection method:

After obtaining the best candidate in each frame, our tracker treats the tracked object as present or absent based on its confidence score and then determines the search state (local search or global search) in the next frame.

Key: Training a deep network(Verifier).

similarity

2 . SiamX: An Efficient Long-term Tracker Using Cross-level Feature Correlation and Adaptive Tracking Scheme

The re-detection method:

Exploiting threshold score.

3 . Combining complementary trackers for enhanced long-term visual object tracking

Running two trackers.

Training a deep network to decide when to re-detect.

4 . GUSOT: Green and Unsupervised Single Object Tracking for Long Video Sequences

Background motion estimation to predict location.

Evaluating two bboxes.

4 . High-Performance Long-Term Tracking with Meta-Updater

input: socre, response map, target object, template

model: lstm

output: score

5. MFT: Long-Term Tracking of Every Pixel

Optical flow is very important.

6. Multi-Template Temporal Siamese Network for Long-Term Object Tracking

Predicting trajectoyies.

The re-detection method:

Input: distance error

Output: Reliability Score

7. Multi-Template Temporal Siamese Network for Long-Term Object Tracking

Predicting trajectoyies.

The re-detection method:

Input: distance error

Output: Reliability Score

8. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter.

9. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter.

10. Target-Aware Tracking with Long-term Context Attention

The re-detection method:

P-mean_score：

score_list.append(out[“conf_score”])

score_num+=1

if score_num==10:

n=score_num

p_mean=0.0

for i in range(n):

for j in range(i,n):

p_mean+=(score_list[i]/(j+1))

p_mean=(p_mean/n)

score_num=0

score_list=[]

if p_mean<=0.58 :#and idx%5==0:

print(“global start ! seq_name : “,seq_name, ” “,frame)

out= global_tracker.track(image, info=None)

boxes[frame, :] = out[“target_bbox”]

local_tracker.state=boxes[frame, :]

Threshold：

11. Unifying Short and Long-Term Tracking with Graph Hierarchies

Difficult to understand.

研究進捗

週報(SUN YUYA)

2023年12月6日孙愉亚コメントする

This week I continue conducting experiments about long term tracking. Except for the experiments, I am ready for reading some important papers about long term tracking.

1. CoTracker: It is Better to Track Together

The paper’s purpose:

(1) Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance.

Contributions:

(1) In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video.

(2)It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories.

Personal Evaluation:

Release code. Very important. The innovation points are very innovative.The paper is related to optical flow and tragectory. We may use this paper to find the of motion of environment.

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

（1）To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

（2）The general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code. The paper is related to 3D tracking.

3. RSPT: Reconstruct Surroundings and Predict Trajectories
for Generalizable Active Object Tracking

The paper’s purpose:

(1) However, building a generalizable active tracker that works robustly across different scenarios remains a challenge, especially in unstructured environments with cluttered obstacles and diverse layouts.

Contributions:

(1) To address this challenge, we present RSPT, a framework that forms a structure-aware motion representation by Reconstructing the Surroundings and Predicting the target Trajectory.

Personal Evaluation:

No code. Don’t know how to exploit this paper.

4. Visual Prompt Multi-Modal Tracking

The paper’s purpose:

(1) Trajectory prediction.

Contributions:

(1)ARTrack tackles tracking as a coordinate sequence interpretation task that estimates object trajectories progressively, where the current estimate is induced by previous states and in turn affects subsequences.

(2)This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy.

Personal Evaluation:

Release code. Worth reading.

5. Global Instance Tracking: Locating Target More Like Humans

The paper’s purpose:

(1)The massive gap indicates that researches only measure tracking performance rather than intelligence.

(2) Occlusion and fast motion.

Contributions:

(1) In this article, we first propose the global instance tracking (GIT) task, which is supposed to search an arbitrary user-specified instance in a video without any assumptions about camera or motion consistency, to model the human visual tracking ability.

(2) Whereafter, we construct a high-quality and large-scale benchmark VideoCube to create a challenging environment.

(3) Finally, we design a scientific evaluation procedure using human capabilities as the baseline to judge tracking intelligence.

(4) Additionally, we provide an online platform with toolkit and an updated leaderboard.

Personal Evaluation:

A new dataset benchmark. Maybe it’s very important.

6. Continuity-Aware Latent Inter frame Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

7. DropMAE: Masked Auto-encoders with Spatial-Attention Dropout for Tracking Tasks

The paper’s purpose:

(1) MAE (masked autoencoder ) heavily relies on spatial cues while ignoring temporal relations for frame reconstruction.

Contributions:

(1) DropMAE adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.

(2) Improving the pre-training process.

Personal Evaluation:

Release code. The innovation points are very innovative. I can’t understand the paper.

8. Learning Historical Status Prompt for Accurate and Robust Visual Tracking

The paper’s purpose:

(1) However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame.

(2) The incapacity to integrate abundant and effective historical information.

Contributions:

(1) HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information.

Personal Evaluation:

No code. How to produce mask in tracking ?

9. Lightweight Full-Convolutional Siamese Tracker

The paper’s purpose:

(1) The current tracking model is too big.

Contributions:

(1) LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline.

(2) Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

10. LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

The paper’s purpose:

(1) Too big and too slow.

Contributions:

The main innovations of LiteTrack encompass:

(1) Asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation.

(2) Pruning encoder layers from a heavy tracker to refine the balance between performance and speed.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

11. MixFormerV2: Efficient Fully Transformer Tracking

The paper’s purpose:

(1) Too slow.

Contributions:

(1) Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

(2)Then, we apply the unified transformer backbone on these mixed token sequence.

(3) Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

12 .Mobile Vision Transformer-based Visual Object Tracking

The paper’s purpose:

(1) Too slow and too big.

Contributions:

The main innovations of LiteTrack encompass:

(1) We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time.

(2) We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

13. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

The paper’s purpose:

(1) It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion.

Contributions:

(1)The encoder extracts visual features with a bidirectional transformer.

(2)While the decoder generates a sequence of bounding box values autoregressively with a causal transformer.

Personal Evaluation:

Release code. Worth reading.

14. SOTVerse: A User-defined Task Space of Single Object Tracking

The paper’s purpose:

(1)The former causes existing datasets can not be exploited comprehensively, while the latter neglects challenging factors
in the evaluation process.

Contributions:

(1)We first propose a 3E Paradigm to describe tasks by three components (i.e., environment, evaluation, and executor).

(2)Then, we summarize task characteristics, clarify the organization
standards, and construct SOTVerse with 12.56 million frames. Specifically, SOTVerse automatically labels challenging factors per frame, allowing users to generate user-defined spaces efficiently via construction rules.

(3) Besides, SOTVerse provides two mechanisms with new indicators and successfully evaluates trackers under various subtasks.

Personal Evaluation:

Release code. Interesting Innovation. Worth reading.

15. Towards Grand Unification of Object Tracking

The paper’s purpose:

(1)We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Contributions:

(1) Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks.

Personal Evaluation:

Release code. Nice paper. Worth reading.

16. Tracking through Containers and Occluders in the Wild

The paper’s purpose:

(1)Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems

Contributions:

(1) We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists.

(2) To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment.

Personal Evaluation:

Release code. Nice paper. Worth reading.

研究進捗

週報(SUN YUYA)

2023年11月29日孙愉亚コメントする

This week I continue conducting experiments about long term tracking. Except for the experiments, I still read the papers about long term object tracking.

Among these papers, there are two trackers are the practical, namely Unicorn and Mixformer2. The Unicorn is a tracker with global detection, which can locate the target object on the whole frame. The Mixformer2 are a high-speed and accurate tracker, which can be taken as local tracker. If combining the advantage of two trackers, we can get a excellent long term tracker.

The evaluation of Unicorn and Mixformer2 on the open benchmark (LaSOT, TrackingNet and GOT-10k) are as follows:

Unicorn: 10 fps.

Mixformer2: 100fps

trackingnet | AUC | OP50 | OP75 | Precision | Norm Precision |
Unicorn | 95.24 | 100.00 | 100.00 | 100.00 | 100.00 |

got10k | AUC | OP50 | OP75 | Precision | Norm Precision |
Unicorn | 82.17 | 92.17 | 82.61 | 80.01 | 91.05 |

lasot | AUC | OP50 | OP75 | Precision | Norm Precision |
Unicorn | 67.49 | 77.64 | 64.81 | 73.00 | 75.52 |

lasot | AUC | OP50 | OP75 | Precision | Norm Precision |
MixFormerDeit-small | 60.75 | 71.73 | 53.66 | 60.91 | 68.98 |

lasot | AUC | OP50 | OP75 | Precision | Norm Precision |
MixFormerDeit-big | 69.95 | 81.65 | 68.83 | 75.11 | 79.69 |

The new papers are as follows:

I change my mind that I should read more new papers instead of conducting experiments in haste. The specific analysis of papers are as follows:

1. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking

The paper’s purpose:

To slove the problem that traditional CNN is too slow.

Contributions:

(1) The paper proposes a framework, where feature learning and template-search coupling are integrated into an efficient one-stream ViT to avoid an extra heavy relation modeling module.

(2)The proposed Aba-ViT exploits an adaptive and background-aware token computation method to reduce inference time.

(3)This approach adaptively discards tokens based on learned halting probabilities, which a priori are higher for background tokens than target ones.

(4) Very Fast ! 180 fps!

Personal Evaluation:

Too fast! And the paper provide the code. We can let it act as main tracker.

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

（1）To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

（2）The general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code. The paper is related to 3D tracking.

3. Compact Transformer Tracker with Correlative Masked Modeling

The paper’s purpose:

(1) Proving that the traditional selfattention structure is sufficient for information aggregation, and structural adaption is unnecessary.

Contributions:

(1) The paper attaches a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens.

(2) The structure is very simple.

Personal Evaluation:

Release code. The evaluation on benchmark is very high. Nice paper. But the analysis of network is beyond my ability.

4. Continuity-Aware Latent Inter frame Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

5. CiteTracker: Correlating Image and Text for Visual Tracking

The paper’s purpose:

(1) track targets with drastic variations.

Contributions:

(1) Developing a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target.

(2) A dynamic description module is designed to adapt to target variations for more effective target representation.

Personal Evaluation:

Release a half code. The innovation points are very innovative.But I don’t know if it works.

6. CoTracker: It is Better to Track Together

The paper’s purpose:

(1) Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance.

Contributions:

(1) In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video.

Personal Evaluation:

Release code. Very important. The innovation points are very innovative.The paper is related to optical flow and tragectory. We may use this paper to find the of motion of environment.

7. DropMAE: Masked Auto-encoders with Spatial-Attention Dropout for Tracking Tasks

The paper’s purpose:

(1) MAE (masked autoencoder ) heavily relies on spatial cues while ignoring temporal relations for frame reconstruction.

Contributions:

(1) DropMAE adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.

(2) Improving the pre-training process.

Personal Evaluation:

Release code. The innovation points are very innovative. I can’t understand the paper.

9. Efficient Training for Visual Tracking with Deformable Transformer

The paper’s purpose:

(1) In order to solve the problems that resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference, the paper want to improve inefficient training methods and convolution based target heads.

Contributions:

(1)Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs.

(2)For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model’s convergence.

Personal Evaluation:

No code.

9. Efficient Visual Tracking with Exemplar Transformers

The paper’s purpose:

(1) However, in the pursuit of increased tracking performance, runtime is often hindered.
(2) Furthermore, efficient tracking architectures have received
surprisingly little attention.

Contributions:

(1) E.T.Track, our visual tracker that incorporates Exemplar Transformer modules, runs at 47 FPS on a CPU.

(2)In this paper, we introduce the Exemplar Transformer, a transformer module utilizing a single instance level attention layer for realtime visual object tracking.

Personal Evaluation:

Release code. Very fast and the architecture is very simple. But the evaluation seems not very good.

11. Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking

The paper’s purpose:

(1) Throw the backbone away.

Contributions:

(1) bridge the gap between single-branch network and discriminative models.

(2) bolsters the tracking robustness in the presence of distractors via bidirectional cycle tracking verification.

Personal Evaluation:

No code. Old innovation point.

10. Exploiting Image-Related Inductive Biases in Single-Branch Visual Tracking

The paper’s purpose:

(1) However, existing trackers are hampered by low speed, limiting their applicability on devices with limited computational power

Contributions:

(1) The central idea of HiT is the Bridge Module, which bridges the gap between modern lightweight transformers and the tracking framework.

(2)We also propose a novel dual-image position encoding technique that simultaneously encodes the position information of both the search region and template images.

Personal Evaluation:

Release a half code. Old innovation point.

61 fps is not sufficient.

11. Generalized Relation Modeling for Transformer Tracking

The paper’s purpose:

(1) Existing onestream trackers always let the template interact with all parts inside the search region throughout all the encoder layers.

Contributions:

(1) Enabling more flexible relation modeling by selecting appropriate search tokens to interact with template token.

(2)An attention masking strategy and the Gumbel-Softmax technique are introduced to facilitate the parallel computation and end-to-end learning of the token division module.

Personal Evaluation:

Release code. The innovation point is about structure of network.

12. Global Instance Tracking: Locating Target More Like Humans

The paper’s purpose:

(1)The massive gap indicates that researches only measure tracking performance rather than intelligence.

(2) Occlusion and fast motion.

Contributions:

(2) Whereafter, we construct a high-quality and large-scale benchmark VideoCube to create a challenging environment.

(3) Finally, we design a scientific evaluation procedure using human capabilities as the baseline to judge tracking intelligence.

(4) Additionally, we provide an online platform with toolkit and an updated leaderboard.

Personal Evaluation:

A new dataset benchmark. Maybe it’s very important.

13. Learning Historical Status Prompt for Accurate and Robust Visual Tracking

The paper’s purpose:

(2) The incapacity to integrate abundant and effective historical information.

Contributions:

(1) HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information.

Personal Evaluation:

No code. How to produce mask in tracking ?

14.Leveraging the Power of Data Augmentation for Transformer-based Tracking

The paper’s purpose:

(1) Previous works focus on designing effective architectures suited
for tracking, but ignore that data augmentation is equally crucial for training a well-performing model.

Contributions:

(1) We optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples.

(2) We propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.

Personal Evaluation:

No code. It’s diffcult to add a new tracking toolkit.

15.Lightweight Full-Convolutional Siamese Tracker

The paper’s purpose:

(1) The current tracking model is too big.

Contributions:

(1) LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline.

(2) Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

16.LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

The paper’s purpose:

(1) Too big and too slow.

Contributions:

The main innovations of LiteTrack encompass:

(1) Asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation.

(2) Pruning encoder layers from a heavy tracker to refine the balance between performance and speed.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

17.MixFormerV2: Efficient Fully Transformer Tracking

The paper’s purpose:

(1) Too slow.

Contributions:

(1) Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

(2)Then, we apply the unified transformer backbone on these mixed token sequence.

(3) Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

18.Mobile Vision Transformer-based Visual Object Tracking

The paper’s purpose:

(1) Too slow and too big.

Contributions:

The main innovations of LiteTrack encompass:

(1) We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time.

(2) We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization.

Personal Evaluation:

Release code. Another fast tracker. Worthing reading.

19.OmniTracker: Unifying Object Tracking by Tracking-with-Detection

The paper’s purpose:

(1) Combing SOT and MOT.

Contributions:

(1) We propose a novel tracking-with detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for association.

(2)Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and
inference pipeline.

Personal Evaluation:

No code. Useless.

20.Robust Object Modeling for Visual Tracking

The paper’s purpose:

(1) However, separate template learning lacks communication between the template and search regions, which brings difficulty in extracting discriminative target-oriented features.

(2)On the other hand, interactive template learning produces hybrid template features, which may introduce potential distractors to the template via the cluttered search regions

Contributions:

（1）To enjoy the merits of both methods, we propose a robust object modeling framework for visual tracking (ROMTrack), which simultaneously models the inherent template and the hybrid template features.

（2）To further enhance robustness, we present novel variation tokens to depict the ever-changing appearance of target objects.

Personal Evaluation:

Release code. There is too much nonsense in this paper. Normal paper.

21.RSPT: Reconstruct Surroundings and Predict Trajectories
for Generalizable Active Object Tracking

The paper’s purpose:

Contributions:

(1) To address this challenge, we present RSPT, a framework that forms a structure-aware motion representation by Reconstructing the Surroundings and Predicting the target Trajectory.

Personal Evaluation:

No code. Don’t know how to exploit this paper.

22.Segment and Track Anything

The paper’s purpose:

(1) To precisely and effectively segment and track any object in a video.

Contributions:

(1)Additionally, SAM-Track employs multimodal interaction methods that enable users to select multiple objects in videos for tracking, corresponding to their specific requirements.

(2) As a result, SAM-Track can be used across an array of fields, ranging from drone technology, autonomous driving, medical imaging, augmented reality, to biological analysis.

Personal Evaluation:

Release code. I don’t need the tracker with too much function.

23.Separable Self and Mixed Attention Transformers for Efficient Object Tracking

The paper’s purpose:

(1) However, the transformer-based models are underutilized for Siamese lightweight tracking due to the computational complexity of their attention blocks.

Contributions:

(1)This paper proposes an efficient self and mixed attention transformerbased architecture for lightweight tracking.

(2)Our prediction head performs global contextual modeling of the encoded features by leveraging efficient self-attention blocks
for robust target state estimation.

Personal Evaluation:

Release code. Normal paper.

24. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

The paper’s purpose:

(1) It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion.

Contributions:

(1)The encoder extracts visual features with a bidirectional transformer.

(2)While the decoder generates a sequence of bounding box values autoregressively with a causal transformer.

Personal Evaluation:

Release code. Worth reading.

25.SiamTHN: Siamese Target Highlight Network for Visual Tracking

The paper’s purpose:

(1)The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region.

(2) Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training.

(3)Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results.

Contributions:

(1) In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region.
(2) To reduce the misalignment and produce more precise tracking
results, we propose a corrective loss to train the model.

Personal Evaluation:

No code. Normal paper.

26.SOTVerse: A User-defined Task Space of Single Object Tracking

The paper’s purpose:

(1)The former causes existing datasets can not be exploited comprehensively, while the latter neglects challenging factors
in the evaluation process.

Contributions:

(1)We first propose a 3E Paradigm to describe tasks by three components (i.e., environment, evaluation, and executor).

(3) Besides, SOTVerse provides two mechanisms with new indicators and successfully evaluates trackers under various subtasks.

Personal Evaluation:

Release code. Interesting Innovation. Worth reading.

27.Towards Efficient Training with Negative Samples in Visual Tracking

The paper’s purpose:

(1)Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting.

Contributions:

(1) To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target’s location in the presence of negative samples, offering an efficient way to manage the mixed sample training.

(2) Our approach introduces a target-indicating token，encapsulating the target’s precise location within the template image.

Personal Evaluation:

No code. Normal paper.

28. Towards Grand Unification of Object Tracking

The paper’s purpose:

(1)We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Contributions:

(1) Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks.

Personal Evaluation:

Release code. Nice paper. Worth reading.

29. Tracking through Containers and Occluders in the Wild

The paper’s purpose:

(1)Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems

Contributions:

(1) We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists.

Personal Evaluation:

Release code. Nice paper. Worth reading.

30. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking

The paper’s purpose:

(1)However, the dynamic changes in flight maneuver and viewpoint encountered in UAV tracking pose significant difficulties, e.g., aspect ratio change, and scale variation.

(2) The conventional cross-correlation operation, while commonly used, has limitations in effectively capturing perceptual similarity and incorporates extraneous background information.

Contributions:

(1) The proposed method designs a new task-specific object saliency
mining network to refine the cross-correlation operation and
effectively discriminate foreground and background information.

(2) Additionally, a saliency adaptation embedding operation dynamically generates tokens based on initial saliency, thereby
reducing the computational complexity of the Transformer architecture.

(3) Finally, a lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.

Personal Evaluation:

Release code. Normal paper.

31. Transparent Object Tracking with Enhanced Fusion Module

The paper’s purpose:

(1)Accurate tracking of transparent objects, such as glasses, plays a critical role in many robotic tasks such as robot-assisted living.

(2) Due to the adaptive and often reflective texture of such objects, traditional tracking algorithms that rely on general-purpose learned features suffer from reduced performance.

(3) For example, many of the current days’ transformer-based trackers are fully pre-trained and are sensitive to any latent space perturbations.

Contributions:

(1)Our proposed fusion module, composed of a transformer encoder and an MLP module, leverages key query-based transformations to
embed the transparency information into the tracking pipeline.

(2)We also present a new two-step training strategy for our fusion
module to effectively merge transparency features.

Personal Evaluation:

Release a half code. Normal paper.

32. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

The paper’s purpose:

(1) Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks.

Contributions:

(1)We design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and
mitigate performance loss.

Personal Evaluation:

Release a half code. Normal paper.

33. Visual Prompt Multi-Modal Tracking

The paper’s purpose:

(1) Trajectory prediction.

Contributions:

Personal Evaluation:

Release code. Worth reading.

34. VideoTrack: Learning to Track Objects via Video Transformer

The paper’s purpose:

(1) Existing Siamese tracking methods, which are built on pair-wise matching between two single frames, heavily rely on additional sophisticated mechanism to exploit temporal information among successive video frames, hindering them from efficiency and industrial deployments.

Contributions:

(1)Specifically, we adapt the standard video transformer architecture to visual tracking by enabling spatiotemporal feature learning directly from frame-level patch sequences.

(2) To better adapt to the tracking task, we carefully blend the spatiotemporal information in the video clips through sequential multi-branch triplet blocks, which formulates a video transformer backbone.

(3) Our experimental study compares different model variants, such as tokenization strategies, hierarchical structures, and video attention schemes.

(4) Then, we propose a disentangled dual-template mechanism that decouples static and dynamic appearance clues over time, and reduces temporal redundancy in video frames.

Personal Evaluation:

No code. Could not understand the paper

35. Foreground-Background Distribution Modeling Transformer for Visual Object Tracking

The paper’s purpose:

(1) However, the feature learning of these Transformer-based trackers is easily disturbed by complex backgrounds.

Contributions:

(1)To address the above limitations, we propose a novel foreground-background distribution modeling transformer for visual object tracking (FBDMTrack), including a fore-background agent learning
(FBAL) module and a distribution-aware attention (DA2) module in a unified transformer architecture.

(2)The proposed F-BDMTrack enjoys several merits. First, the proposed FBAL module can effectively mine fore-background information with designed fore-background agents.

(3) Second, the DA2 module can suppress the incorrect interaction between foreground and background by modeling fore-background
distribution similarities.

(4) Finally, F-BDMTrack can extract discriminative features under ever-changing tracking scenarios for more accurate target state estimation.

Personal Evaluation:

No code. Could not understand the paper.

36. Representation Learning for Visual Object Tracking by Masked Appearance Transfer

The paper’s purpose:

(1) However, few works study the trackingspecified representation learning method. Most trackers directly use ImageNet pre-trained representations.

Contributions:

(1) However, for the template, we make the decoder reconstruct the target appearance within the search region.

(2) We randomly mask out the inputs, thereby making the learned representations more discriminative.

Personal Evaluation:

Release code. Complex code.

37. ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

The paper’s purpose:

In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size.

Contributions:

(1) To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa.

(2)This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size.

(3) Our formulation for the nonuniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers.

Personal Evaluation:

Release a half code. Normal paper.

研究進捗

週報(SUN YUYA)

2023年11月15日孙愉亚コメントする

This week I start conducting experiments about long term tracking. Except for the experiments, I still read some papers about long term object tracking.

The experiments are as follows: On the basis of a simple siamese tracker(SiamFC++), I modified it to a long term tracker，which needs a mechanism to judge whether the target object is absent. The most concise way is to exploit the confidence score. So I want to find a relationship between the score and IOU. The specific results are shown in the figure.

From the picture, we can see that the fluctuation trend of score is basically consistent with that of iou. But only relying on the score is not enough. So we may conduct experiments on the sequence of scores.

The new papers are as follows:

I change my mind that I should read more new papers instead of conducting experiments in haste. The specific analysis of papers are as follows:

1. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking

The paper’s purpose:

To slove the problem that traditional CNN is too slow.

Contributions:

(1) The paper proposes a framework, where feature learning and template-search coupling are integrated into an efficient one-stream ViT to avoid an extra heavy relation modeling module.

(2)The proposed Aba-ViT exploits an adaptive and background-aware token computation method to reduce inference time.

(3)This approach adaptively discards tokens based on learned halting probabilities, which a priori are higher for background tokens than target ones.

(4) Very Fast ! 180 fps!

Personal Evaluation:

Too fast! And the paper provide the code. We can let it act as main tracker.

2. Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training

The paper’s purpose:

（1）To slove the problem that the siamese tracker was trapped when facing multiple views of object in consecutive frames.

（2）The general image-level pretrained backbone can overfit to holistic representations, causing the misalignment to learn object-level properties in UAV tracking.

Contributions:

(1) Fully exploit the stereoscopic representation for UAV tracking. Specifically, a novel pre-training paradigm method is proposed.

(2) Through trajectory-aware reconstruction training (TRT), the capability of the backbone to extract stereoscopic structure feature is strengthened without any parameter increment.

Personal Evaluation:

No code. The paper is related to 3D tracking.

3. Compact Transformer Tracker with Correlative Masked Modeling

The paper’s purpose:

(1) Proving that the traditional selfattention structure is sufficient for information aggregation, and structural adaption is unnecessary.

Contributions:

(1) The paper attaches a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens.

(2) The structure is very simple.

Personal Evaluation:

Release code. The evaluation on benchmark is very high. Nice paper. But the analysis of network is beyond my ability.

4. Continuity-Aware Latent Interframe Information Mining for Reliable UAV Tracking

The paper’s purpose:

(1) Mainly focuses on explicit information to improve tracking performance, ignoring potential interframe connections.

Contributions:

(1) A network can generate highly-effective latent frame between two adjacent frames.

(2) Fully explore continuity-aware spatial-temporal information.

Personal Evaluation:

Release code. The innovation points are very innovative.

九州工業大学　芹川・張・山脇・楊研究室

孙愉亚のすべての投稿

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

週報(SUN YUYA)

Stay Hungry, Stay Foolish!

2024年7月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31