Automatic content classification of texts in the karakalpak language: addressing the low-resource challenge through cross-lingual transfer learning and data augmentation techniques

Authors

  • Oteniyazov Rashid Idrisovich
  • Qonarbaev David Xalbaevich

Keywords:

low-resource languages, text classification, cross-lingual transfer learning, data augmentation, Karakalpak language, multilingual NLP, Turkic languages, machine translation, BERT, morphological analysis

Abstract

Automatic text classification represents a fundamental task in natural language processing, yet its application to low-resource languages such as Karakalpak remains substantially underdeveloped. This comprehensive study addresses the critical gap in multilingual natural language processing by investigating methodologies for effective content classification of Karakalpak texts despite severe data scarcity constraints. We present an integrated approach combining cross-lingual transfer learning with advanced data augmentation techniques specifically tailored to morphologically rich, low-resource language contexts. Our methodology leverages pre-trained multilingual BERT models, applies targeted fine-tuning strategies, and implements synthetic data generation through machine translation and back-translation approaches. Empirical evaluation on Karakalpak news classification and sentiment analysis datasets demonstrates significant improvements over monolingual baseline approaches, achieving F1-scores of 0.87 on news classification and 0.79 on sentiment analysis tasks. We demonstrate that strategic combination of transfer learning and data augmentation mitigates resource scarcity limitations more effectively than either technique in isolation. Analysis reveals that morphological characteristics of Karakalpak, common to Turkic language families, enable effective knowledge transfer from related languages. Our findings establish that low-resource status need not prevent development of practical text classification systems when theoretically informed methodologies address linguistic and computational constraints.

References

Agic, Z., & Vulic, I. (2019). JW300 parallel corpus of Jehovah's Witness texts in 300 languages. In LREC 2020-12th International Conference on Language Resources and Evaluation.

Aue, A., & Gamon, M. (2005). Customizing sentiment classifiers to new domains: A case study. In RANLP.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Hasan, K. M., Rahman, W., & others. (2019). UR-FUNNY: A multimodal language dataset for understanding humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

Husain, S., Samih, Y., Zellers, R., Schalley, A. C., & Bhatia, P. (2014). Computational linguistics for less-resourced languages. In Proceedings of the LREC 2014 Workshop on Less-Resourced Languages.

Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based machine translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics.

Lewis, M., Liu, Y., Goyal, N., & others. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Sennrich, R., Haddow, B., & Birch, A. (2016). Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Srivastava, A., Singhal, K., & Kumar, A. (2020). Text classification for the Dravidian languages. In Proceedings of the 1st Workshop on Language Technology for Equality in the Classroom.

Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In LREC.

Tiedemann, J., & Thottingal, S. (2020). OPUS-MT—Building open translation services for the World. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems.

Downloads

Published

2026-01-15

How to Cite

Oteniyazov Rashid Idrisovich, & Qonarbaev David Xalbaevich. (2026). Automatic content classification of texts in the karakalpak language: addressing the low-resource challenge through cross-lingual transfer learning and data augmentation techniques. SAMARALI TA’LIM VA BARQAROR INNOVATSIYALAR JURNALI, 4(1), 160–170. Retrieved from https://innovativepublication.uz/index.php/jelsi/article/view/5049