Abstract
Background: Understanding visual context in images is essential for enhanced Point-of-Interest (POI) recommender systems. Traditional models often rely on global features, overlooking object-level information, which can limit contextual accuracy. Methods: This study introduces micro-level contextual tagging, a method for extracting metadata from individual objects in images, including object type, frequency, and color. This enriched information is used to train WORLDO, a Vision Transformer model designed for multi-task learning. The model performs scene classification, contextual tag prediction, and object presence detection. It is then integrated into a content-based recommender system that supports feature configurations. Results: The model was evaluated on its ability to classify scenes, predict tags, and detect objects within images. Ablation analysis confirmed the complementary role of tag, object, and scene features in representation learning, while benchmarking against CNN architectures showed the superior performance of the transformer-based model. Additionally, its integration with a POI recommender system demonstrated consistent performance across different feature settings. The recommender system produced relevant suggestions and maintained robustness even when specific components were disabled. Conclusions: Micro-level contextual tagging enhances the representation of scene context and supports more informative recommendations. WORLDO provides a practical framework for incorporating object-level semantics into POI applications through efficient visual modeling.
| Original language | English |
|---|---|
| Article number | 293 |
| Journal | Big Data and Cognitive Computing |
| Volume | 9 |
| Issue number | 11 |
| DOIs | |
| Publication status | Published - Nov 2025 |
Keywords
- Artificial Intelligence
- Bag-of-Objects
- Data Science
- deep learning
- Points-of-Interest
- recommender systems
- Transformers