Preview

Open Education

Advanced search

Development of an Intelligent System for Processing Semistructured Data: Industry Structuring and Advanced Analysis of Information Extracted from Comments to Video Clips in Social Networks

https://doi.org/10.21686/1818-4243-2025-2-55-70

Abstract

Scientific relevance of the study. In the era of rapidly increasing volumes of data generated by social media users, analyzing textual data such as comments is becoming one of the key challenges of modern science. Comments are a valuable source of information, allowing us to identify public sentiment, analyze users’ opinions, and track social trends. However, due to the semistructured or completely unstructured nature of these data, their processing requires innovative approaches. Purpose of research. The aim of this research is to develop an intelligent system for processing semistructured data from comments on social media videos using structuring algorithms targeting different industries. The research aims to create an efficient method to analyze tone, clustering and extract key themes from comments in order to evaluate the impact of video content on the audience. The research will propose an approach to automatically extract and structure data by industry, which will allow for a more accurate and in-depth analysis of content perception and its impact on different social and professional domains. Methods. Developing an intelligent system for analyzing semistructured data requires innovative methods and approaches that combine natural language processing (NLP), machine learning algorithms and big data analytics techniques. These methods include: automatic data extraction via API, preprocessing adapted for three languages (French, English and Russian), deep sentiment analysis using the Bert product and a probabilistic algorithm for statistical calculations, and clustering using K-Means, DBSCAN and Agglomerative algorithms. The materials are based on comments from social networks (TikTok, Instagram, Twitter, Facebook, YouTube, Reddit, VKontakte) in   Russian, English and French. SpaCy and NLTK libraries were used for preprocessing, and the Hugging Face Transformers model worked with pre-trained models for sentiment analysis. Machine learning techniques including clustering and natural language processing were used. Data was structured using topic modeling and language models implemented using Python libraries. The results of the study. The development of an intelligent system for processing semistructured data has improved the analysis of comments on videos in social networks through a combination of various machine learning models and algorithms. The results of the study allowed us to develop a prototype of a comment analysis tool that effectively collects   and structures data from various social networks. This data structuring led to better organization and increased accessibility of information, facilitating its utilization. By using natural language processing (NLP) methods, we identified key themes and emotions in the comments while conducting sentiment analysis that highlights major emotional trends. Clustering methods, such as K-means, grouped the comments by similar themes. Additionally, we created visualizations that show sentiment distribution, allowing users to quickly interpret the data. The integration of visualization techniques transforms complex analytical results into intuitive graphs, making it easier to understand user interactions with the content. Thus, our system proves effective in providing valuable insights and optimizing audience interaction strategies. Conclusion. The results of the study showed that the proposed approach significantly improves the accuracy of classification and structuring of semistructured data, especially when it comes to comments extracted from social media videos. The developed system uses natural language processing algorithms to analyze the data with respect to its industry, which allows for automatic structuring of comments depending on their content and detailed tone analysis. The effectiveness of this approach was validated by analyzing comments from various social platforms, which demonstrated its ability to extract and structure relevant information, as well as assess the impact of videos through user reactions.

About the Authors

A. A. Poguda
National Research Tomsk State University
Russian Federation

Alexey A. Poguda - Scientific Supervisor, Candidate of Technical Sciences, Associate Professor, Faculty of Innovative Technologies



H. Tape
National Research Tomsk State University
Russian Federation

Habib Jean Max Tape - Postgraduate student, Faculty of Innovative Technologies



References

1. Kravchenko D.Yu. Model of knowledge ontology for intelligent systems of text processing and analysis. Izvestiya YUFU. Tekhnicheskiye nauki = Bulletin of SFedU. Technical sciences. 2024; 2: 38–50. (In Russ.)

2. Gulay A.V., Zaytsev V.M. Knowledge models as a cognitive component of the systemic construction of intelligent technologies. Razvitiye nauki i tekhnologiy v epokhu global’noy transformatsii = Development of science and technology in the era of global transformation. Petrozavodsk: MCNP “New Science”; 2023: 158–191. (In Russ.)

3. Zhuravkov M.A. Tekhnologii iskusstvennogo intellekta i intellektual’nyye sistemy komp’yuternogo modelirovaniya i inzhenernykh raschetov = Artificial intelligence technologies and intelligent systems of computer modeling and engineering calculations [Internet]. Minsk: Belarusian State University; 2024. 177 p. Available from: https://elib.bsu.by/bitstream/123456789/309072/1/Tekhnologii%20iskusstvennogo%20intellekta%20 i%20intellektual’nyye%20sistemy.pdf.

4. Kaplan A. M., Haenlein M. Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons. 2010; 53; 1: 59–68.

5. Cambria E., Schuller B., Xia Y., Havasi C. New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems. 2017; 28; 2: 15–21.

6. Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019.

7. Pang B., Lee L. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval. 2008; 1–2: 1–135.

8. Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003; 3: 993–1022.

9. Chen M., Mao S., Liu Y. Big data: A survey. Mobile Networks and Applications. 2014; 19; 2: 171–209.

10. Gandomi A., Haider M. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management. 2015; 35; 2: 137–144.

11. Smith J., Brown T. Metadata management for large-scale datasets. Journal of Information Systems. 2019; 25; 3: 120–135.

12. Voigt P., Von dem Bussche A. The EU General Data Protection Regulation (GDPR): A Practical Guide. Springer International Publishing; 2017.

13. Ramos Gargantilla J.A., Mora J., Aguado de Cea G. Enhancing the expressiveness of linguistic structures. 2012.

14. Greer K. Concept Trees: Building Dynamic Concepts from Semi-Structured Data using Nature-Inspired Methods. 2014.

15. Galkin M., Mouromtsev D., Auer S. Identifying Web Tables – Supporting a Neglected Type of Content on the Web. 2015.

16. Giunchiglia F., Zamboni A., Bagchi M., Bocca S. Stratified Data Integration. 2021.

17. Tang C., Yuan G., Zheng T. Weakly Supervised Learning Creates a Fusion of Modeling Cultures. 2021.

18. Koo H., Eun Kim T. A Comprehensive Survey on Generative Diffusion Models for Structured Data. 2023.

19. Liu J., Zhao Z., Wu N., Wang X. Research on the structure function recognition of PLOS [Internet]. 2024. Available from: ncbi.nlm.nih.gov.

20. Mittal A., Bheemreddy A., Tao H. Semantic SQL – Combining and optimizing semantic predicates in SQL. 2024.

21. Vanschoren J., Blockeel H., Pfahringer B., Holmes G. Experiment Databases: Creating a New Platform for Meta-Learning Research. 2008.

22. Anstiss S. Understanding data quality issues in dynamic organisational environments – a literature review. 2012.

23. Yadav C., Wang S., Kumar M. Algorithm and approaches to handle large Data – A Survey. 2013.

24. Komarnitskaya O. Metody avtomatizirovannogo semanticheskogo analiza yestestvennoyazykovoy informatsii = Methods of automated semantic analysis of natural language information. 2018.

25. Leskovec J., Rajaraman A., Ullman J. D. Mining of Massive Datasets. 3rd ed. Cambridge University Press; 2014.

26. Bernstein V., Afanassenkov A. Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting. 2020.

27. Zhouli M., Ganem R., Azzuza M. Semantic analysis of big data: Challenges and opportunities. Issledovaniya v oblasti bol’shikh dannykh = Big Data Research. 2020; 18; 4: 115–130.

28. Chzhang Dzh., Li V., Lyu TS. A review of machine learning algorithms for big data classification. Zhurnal mashinnogo obucheniya = Journal of Machine Learning. 2021; 38; 7: 925–940.

29. Nguyen L., Tran T., Nguyen D. Data classification and clustering: Methods and applications. Zhurnal vychislitel’nogo intellekta = Journal of Computational Intelligence. 2019; 31; 1: 45–58.

30. Chen KH., Li T., Chzhan KH. Application of data clustering in healthcare and finance. Zhurnal nauchnykh issledovaniy dannykh = Journal of Scientific Data Research. 2020; 25; 3: 200–214.

31. Li D., Park Dzh., Kim S. The Role of Machine Learning in Marketing Analytics. Nauchnyye issledovaniya v oblasti marketinga = Scientific Research in Marketing. 2022; 39; 2: 189–204.

32. Vasilenko A., Frolov A., Makarov P. Modern Methods of Processing Unstructured Data in New Technologies. Zhurnal novykh tekhnologiy = Journal of New Technologies. 2021; 14; 1: 112– 124. (In Russ.)

33. Chodpathumwan Y. Cost-effective data structural preparation. 2018.

34. Han J., Kamber M., Pei J. Data Mining: Concepts and Techniques. 3rd ed. Elsevier; 2011.

35. Aggarwal C. C., Reddy C. K. Data Clustering: Algorithms and Applications. CRC Press; 2014.

36. Cambria E., Schuller B., Xia Y., Havasi C. New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems. 2017; 28; 2: 15–21.

37. Manning C. D., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge University Press; 2008.

38. Goodfellow I., Bengio Y., Courville A. Deep Learning. MIT Press; 2016.

39. Bishop C. M. Pattern Recognition and Machine Learning. Springer; 2006.

40. Pennington J., Socher R., Manning C. D. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532–1543

41.


Review

For citations:


Poguda A.A., Tape H. Development of an Intelligent System for Processing Semistructured Data: Industry Structuring and Advanced Analysis of Information Extracted from Comments to Video Clips in Social Networks. Open Education. 2025;29(2):55-70. (In Russ.) https://doi.org/10.21686/1818-4243-2025-2-55-70

Views: 213


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-4243 (Print)
ISSN 2079-5939 (Online)