Journal of Computational and Applied Linguistics

Text Data Augmentation Using Generative Adversarial Networks – A Systematic Review

Kanishka Silva — 2023-07-18

Insufficient data is one of the main drawbacks in natural language processing tasks, and the most prevalent solution is to collect a decent amount of data that will be enough for the optimisation of the model. However, recent research directions are strategically moving towards increasing training examples due to the nature of the data-hungry neural models. Data augmentation is an emerging area that aims to ensure the diversity of data without attempting to collect new data exclusively to boost a model’s performance. Limitations in data augmentation, especially for textual data, are mainly due to the nature of language data, which is precisely discrete. Generative Adversarial Networks (GANs) were initially introduced for computer vision applications, aiming to generate highly realistic images by learning the image representations. Recent research has focused on using GANs for text generation and augmentation. This systematic review aims to present the theoretical background of GANs and their use for text augmentation alongside a systematic review of recent textual data augmentation applications such as sentiment analysis, low resource language generation, hate speech detection and fraud review analysis. Further, a notion of challenges in current research and future directions of GAN-based text augmentation are discussed in this paper to pave the way for researchers especially working on low-text resources.

Extracting Algorithmic Complexity in Scientific Literature for Advance Searching

Abu Bakar — 2023-07-18

Non-textual document elements such as charts, diagrams, algorithms and tables play an important role to present key information in scientific documents. Recent advances in information retrieval systems tap this information to answer more complex user queries by mining text pertaining to non-textual document elements from full text. Algorithms are critically important in computer science. Researchers are working on existing algorithms to improve them for critical application. Moreover, new algorithms for unsolved and newly faced problems are under development. These enhanced and new algorithms are mostly published in scholarly documents. The complexity of these algorithms is also discussed in the same document by the authors. Complexity of an algorithm is also an important factor for information retrieval (IR) systems. In this paper, we mine the relevant complexities of algorithms from full text document by comparing the metadata of the algorithm, such as caption and function name, with the context of the paragraph in which complexity related discussion is made by the authors. Using the dataset of 256 documents downloaded from CiteSeerX repository, we manually annotate 417 links between algorithms and their complexities. Further, we apply our novel rule-based approach that identifies the desired links with 81% precision, 75% recall, 78% F1-score and 65% accuracy. Overall, our method of identifying the links has potential to improve information retrieval systems that tap the advancements of full text and more specifically non-textual document elements.

Computer-assisted Transcription and Analysis of Bulgarian Child Speech Data using CHILDES and CLAN

Velka Popova — 2023-07-18

The present paper focuses on the possibilities offered by corpus linguistics in the study of child speech, with its specificities as a linguistic phenomenon. An attempt is made to highlight the advantages of the CHILDES system for studying spontaneous speech interaction in the Bulgarian corpus of child language (Bulgarian LabLing Corpus), in which the data are transcribed and annotated within this paradigm.

Translation of Metaphors in Official and Automatic Subtitling and MT Evaluation

Maral Shintemirova — 2023-07-18

One of the main aims of this work is to compare and analyse the translation of metaphors in subtitles as performed by human translators and by machine translation, and conduct MT evaluation. The work considers two YouTube videos of a Cyberpunk 2077 (2020) videogame walkthrough. The first video is in the original language (English) with English subtitles and the second one is an officially translated video in Russian, with Russian subtitles. Both videos have the same content, but in different languages. Metaphors were extracted manually from selected audiovisual material in English by the usage of MIPVU (Metaphor Identification Procedure Vrije Universiteit). In order to achieve our aims, first the translation of these metaphors in the official Russian subtitles were analysed; secondly, their automatic translation into Russian as it appears on YouTube by Google Translate were analysed as well; after that the results were compared to find the similarities and the differences between the automatically translated version of the metaphors on YouTube and the translated metaphors in the official subtitling. Another aim is to perform Machine Translation (MT) evaluation using the BLEU (Bilingual Evaluation Understudy) algorithm and to determine the errors made by MT while translating metaphors in the analysed subtitles. Three examples, which were taken from the videos, are presented in the format of cases. The cases show different metaphors and the situations they were used in and analyse why these metaphors were used in that particular situation, how metaphors were identified there, how they were translated and why they were translated exactly in this way. Furthermore, the machine translation of the same metaphors is analysed and a comparison between them is made. The topic of the speech recognition process and the metaphor identification procedure is also touched upon. The results demonstrate that although machine translation is able to translate frequently used, popular metaphors, or metaphors, the literal translation of which retains the meaning, it is still difficult for the machine to recognise original author’s metaphors or to translate using the context of the situation. The results could encourage training the machine to recognize metaphors and to create a larger database of metaphors to identify them.

Gender-neutral Language Use in the Context of Gender Bias in Machine Translation (A Review Literature)

Aida Kostikova — 2023-07-18

Gender bias has become one of the central issues analysed within natural language processing (NLP) research. A main concerns in this field relates to the fact that many NLP tools and automatic machine learning systems not only reflect, but also reinforce social disparities, including those related to gender, and language technology is one of the areas in which this issue is pronounced. This paper analyses the problem of gender-neutral language use from the standpoint of gender bias in machine translation (MT). We determine which types of harms can be caused by the failure to reflect gender-neutral language in translation, provide the general definition of gender bias in MT, describe its sources and provide an overview of existing mitigating strategies. One of the main contributions of this work is that it focuses not only on females, but also non-binary people, whose linguistic visibility has been receiving only limited attention from academia. This literature review provides a firm foundation for further research in this area aimed at addressing the problem of gender bias in machine translation, especially bias linked to representational harms.

How Much Linguistics in Corpus Linguistics? Review of Doing Linguistics with a Corpus by Egbert, J., Larsson, T. and Biber, D. (2020). Cambridge University Press

Maria Stambolieva — 2023-07-18