Intellectual Technologies on Transport

Интеллектуальные технологии на транспорте

2413-2527

108903

10.20295/2413-2527-2026-145-16-22

pfzlam

ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ И ТРАНСПОРТНЫЕ СИСТЕМЫ

ARTIFICIAL INTELLIGENCE AND TRANSPORT SYSTEMS

ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ И ТРАНСПОРТНЫЕ СИСТЕМЫ

Modern Multi-Agent Systems for Data Scraping

Современные многоагентные системы для скрапинга данных

Блюм

Владислав Станиславович

Blyum

Vladislav Stanislavovich

vladblum7@gmail.com

кандидат технических наук;

candidate of technical sciences;

Лапшин

Андрей Евгеньевич

Lapshin

Andrey Evgenyevich

andreyka.lapshin.2002@mail.ru

Институт технологий предпринимательства и права, Санкт-Петербургский государственный университет аэрокосмического приборостроения Санкт-Петербург Россия Institute of Entrepreneurship Technologies and Law, Saint Petersburg State University of Aerospace Instrumentation Saint Petersburg Russian Federation

25 03 2026

1 16 22 29 11 2025 09 02 2026

https://pgups.editorum.ru/en/nauka/article/108903/view

Рассматриваются архитектура многоагентных систем (МАС), свойства агентов, особенности коммуникации и применимость данного подхода к задачам веб-скрапинга. Актуальность исследования определяется стремительным ростом объемов данных в сети Интернет и ограниченностью классических централизованных систем веб-скрапинга, сталкивающихся с проблемами масштабирования, блокировок и недостаточной устойчивости. В этих условиях возрастает потребность в использовании децентрализованных архитектур, способных адаптироваться к динамичной среде и эффективно собирать большие объемы информации. Одним из наиболее перспективных подходов являются многоагентные системы, обеспечивающие распределенный сбор, обработку и хранение данных. Цель: разработка и структурирование подхода к использованию многоагентных систем для веб-скрапинга, а также описание обобщенного алгоритма, обеспечивающего масштабируемый, отказоустойчивый и адаптивный сбор данных. Методы: теоретический анализ свойств многоагентных систем, архитектурных моделей и коммуникационных механизмов между агентами; изучение существующих практических решений распределенного краулинга; синтез обобщенного алгоритма на основе выделения типовых ролей агентов (планировщик, сборщик, парсер, обработчик данных, агент обхода защиты). Результаты: описана трехуровневая архитектура МАС, включающая уровни сбора, обработки/координации и хранения данных. Выделены ключевые свойства агентов и показаны их роли в задаче скрапинга. Представлены функции пяти типов агентов, применяемых в распределенном веб-скрапинге, и предложена схема взаимодействия между ними. На основе анализа существующих решений сформирован обобщенный алгоритм распределенного скрапинга, отражающий взаимодействие специализированных агентов, который включает этапы инициализации, распределения задач, загрузки страниц, обработки ошибок блокировки, парсинга контента и сохранения данных. Показано, что многоагентный подход обеспечивает параллелизм, масштабируемость, отказоустойчивость и гибкость при работе с веб-ресурсами. Практическая значимость: результаты исследования могут быть использованы при проектировании систем массового сбора данных, построении распределенных веб-краулеров и создании платформ ана- лиза информации на основе МАС. Обобщенный алгоритм может служить основой для реализации гибких и масштабируемых систем, способных эффективно функционировать в условиях больших объемов данных, динамических изменений веб-страниц и наличия защитных механизмов. Обсуждение: в статье описывается интеграция свойств и принципов многоагентных систем в контекст веб-скрапинга с формированием единой обобщенной модели взаимодействия агентов. Представленный алгоритм отражает практическую структуру функционирования распределенного краулера и демонстрирует, как различные типы агентов могут обеспечивать координацию, сбор, анализ и фильтрацию данных при работе с динамичными и защищенными веб-ресурсами. Подчеркнута значимость децентрализации и адаптивности для современного веб-скрапинга, включая работу в условиях ограничений, связанных с антибот-защитами.

This paper examines the architecture of multi-agent systems (MAS), agent properties, communication features, and the applicability of this approach to web scraping tasks. The relevance of the study is determined by the rapid growth of data volumes on the Internet and the limitations of traditional centralized web scraping systems that encounter challenges related to scalability, blocking, and insufficient robustness against dynamic website changes. In this context, there is an increasing demand for decentralized architectures that adapt to evolving environments and efficiently collect vast quantities of information. One of the most promising approaches is the deployment of multi-agent systems, which enable distributed data collection, parallel processing, and resilient storage. Purpose: to develop and structure an approach for utilizing multi-agent systems in web scraping, as well as to describe a generalized algorithm that ensures scalable, fault-tolerant, and adaptive data collection. Methods: the study employs theoretical analysis of multi-agent system properties, architectural models, and inter-agent communication mechanisms; an examination of existing practical implementations of distributed web crawling; and the synthesis of a generalized algorithm constructed upon the identification of typical agent roles: scheduler, collector, parser, data processor, and protection bypass agent. Results: the findings reveal a three-tiered architecture for the multi-agent system, including levels for data collection, processing/coordinating, and storage. Key properties of agents are highlighted, demonstrating their distinct contributions to the scraping task. The functions of five types of agents used in distributed web scraping are presented, alongside a proposed interaction scheme illustrating their collaborative engagement. Based on the analysis of existing solutions, a generalized algorithm for distributed scraping has been formulated, reflecting the interaction of these specialized agents. This algorithm encompasses distinct stages: initialization, task distribution, page loading, error handling in blocking scenarios, content parsing, and data storage. The findings indicate that the multi-agent approach provides parallelism, scalability, fault tolerance, and flexibility, adapting to diverse web resources and evolving challenges. Practical significance: the results of this research can be used in the design of mass data collection systems, the construction of distributed web crawlers, and the creation of information analysis platforms based on multi-agent systems. The generalized algorithm can serve as the basis for implementing flexible and scalable systems capable of functioning effectively in the context of vast data volumes, dynamic web page alterations, and robust protective mechanisms. Discussion: this article describes the integration of multi-agent system properties and principles into web scraping processes, culminating in the formation of a unified generalized model of agent interaction. The presented algorithm mirrors the practical structure of a distributed crawler and demonstrates how different types of agents can coordinate, collect, analyze, and filter data when interacting with dynamic and secure web resources. The importance of decentralization and adaptability for modern web scraping is emphasized, particularly in scenarios constrained by anti-bot protection.

многоагентные системы скрапинг масштабирование проактивность автономность

multi-agent systems scraping scaling proactivity autonomy

Coughlin T. 175 Zettabytes By 2025 // Forbes. 2018. 27 November. URL: http://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025 (дата обращения: 05.10.2025).

Coughlin T. 175 Zettabytes By 2025, Forbes. Published online at November 27, 2018. Available at: http://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025 (accessed: October 05, 2025).

Barrett A. How to Scrape Websites at Large Scale // Octoparse Web Scraping Blog. 2022. 30 August. URL: http://www.octoparse.com/blog/scrape-websites-at-large-scale (дата обращения: 05.10.2025).

Barrett A. How to Scrape Websites at Large Scale, Octoparse Web Scraping Blog. Published online at August 30, 2022. Available at: http://www.octoparse.com/blog/scrape-websites-at-large-scale (accessed: October 05, 2025).

Jennings N. R., Wooldridge M. J. Applications of Intelligent Agents // Agent Technology: Foundations, Applications, and Markets / N. R. Jennings, M. J. Wooldridge (eds). Heidelberg: Springer, 1998. Pp. 3–28. DOI: 10.1007/978-3-66203678-5_1.

Jennings N. R., Wooldridge M. J. Applications of Intelligent Agents. In: Jennings N. R., Wooldridge M. J. (eds) Agent Technology: Foundations, Applications, and Markets. Heidelberg, Springer, 1998, pp. 3–28. DOI: 10.1007/978-3-66203678-5_1.

Фаулер М. Архитектура корпоративных программных приложений / пер. с англ. М.: Вильямс, 2006. 544 с.

Fowler M. Arkhitektura korporativnykh programmnykh prilozheniy [Patterns of enterprise application architecture]. Moscow, Williams Publishing House, 2006, 544 p. (In Russian)

De Ridder A. An Introduction to FIPA Agent Communication Language: Standards for Interoperable Multi-Agent Systems // SmythOS AI Blog. URL: http://smythos.com/developers/agent-development/fipa-agent-communication-language (дата обращения: 22.11.2025).

De Ridder A. An Introduction to FIPA Agent Communication Language: Standards for Interoperable Multi-Agent Systems, SmythOS AI Blog. Available at: http://smythos.com/developers/agent-development/fipa-agent-communication-language (accessed: November 22, 2025).

Кияев В. И., Граничин О. Н. Информационные технологии в управлении предприятием: краткий учебный курс. 2-е изд., испр. М.: ИНТУИТ, 2016. 361 с.

Kiyaev V. I., Granichin O. N. Informatsionnye tekhnologii v upravlenii predpriyatiem: kratkiy uchebnyy kurs [Information Technology in Business Management: A Concise Educational Course]. Moscow, INTUIT, 2016, 361 p. (In Russian)

The Data Extraction Using Distributed Crawler Inside the Multi-Agent System / K. Tomala [et al.] // Advances in Electrical and Electronic Engineering, 2013. Vol. 11, no. 6. Pp. 455–460. DOI: 10.15598/aeee.v11i6.867.

Tomala K., et al. The Data Extraction Using Distributed Crawler Inside the Multi-Agent System, Advances in Electrical and Electronic Engineering, 2013. Vol. 11, no. 6. Pp. 455–460. DOI: 10.15598/aeee.v11i6.867.

Extensible Markup Language (XML) 1.0 (Fifth Edition) — W3C Recommendation 26 November 2008 / T. Bray [et al.] (eds). URL: http://www.w3.org/TR/xml (дата обращения: 22.11.2025).

Bray T., et al. (eds) Extensible Markup Language (XML) 1.0 (Fifth Edition) — W3C Recommendation 26 November 2008. Available at: http://www.w3.org/TR/xml (accessed: November 22, 2025).

Transmission Control Protocol // Wikipedia. URL: http://en.wikipedia.org/wiki/Transmission_Control_Protocol (дата обращения: 22.11.2025).

Transmission Control Protocol, Wikipedia. Available at: http://en.wikipedia.org/wiki/Transmission_Control_Protocol (accessed: November 22, 2025).

10.

MD5 // Wikipedia. URL: http://en.wikipedia.org/wiki/MD5 (дата обращения: 22.11.2025).

MD5, Wikipedia. Available at: http://en.wikipedia.org/wiki/MD5 (accessed: November 22, 2025).