Browsing by Issue Date
Now showing items 1-2 of 2
-
(2023)The vast majority of the world's languages are low-resource, lacking the data resources required in advanced natural language processing (NLP) based on data-intensive deep learning. Furthermore, annotated training data can be insufficient in some domains even within resource-rich languages. Low-resource NLP is crucial for both the inclusion of language communities in the NLP sphere and the extension of applications over a wider range of domains. The objective of this thesis is to contribute to this long-term goal especially with regard to truly low-resource languages and domains. We address truly low-resource NLP in the context of two tasks. First, we consider the low-level task of cognate identification, since cognates are useful for the cross-lingual transfer of many lower-level tasks into new languages. Second, we examine the high-level task of document planning, a fundamental task in data-to-text natural language generation (NLG), where many domains are low-resource. Thus, domain-independent document planning supports the transfer of NLG across domains. Following recent encouraging results, we propose neural network models to these tasks, using transfer learning methods in three low-resource scenarios. We divide our high-level objective into three research tasks characterised by different resource conditions. In our first research task, we address cognate identification in endangered Sami languages of the Uralic family, given scarce labelled training data. We propose a Siamese convolutional neural network (S-CNN) and a support vector machine (SVM), which we pre-train on unrelated Indo-European data, lacking high-resource close relatives. We find that S-CNN performs best at direct transfer to Sami, and adapts fast when fine-tuned on a small amount of Sami data. In our second research task, we address a scenario with only unlabelled data to adapt S-CNN from Indo-European to Uralic data. We propose both discriminative adversarial networks and pre-trained symbol embeddings, finding that adversarial adaptation outperforms an unadapted model, while symbol embeddings are beneficial when languages have disparate orthographies. In our third research task, we address document planning in data-to-text generation of news, in a domain with no annotated training data whatsoever. We propose distant supervision, automatically constructing labelled data from a news corpus, and train a neural model for sentence ordering, a task related to document planning. We examine Siamese, positional, and pointer networks, and find that a variant of S-CNN results in generation with higher human-perceived quality than heuristic baselines. The contributions of this thesis include addressing novel low-resource scenarios considering two NLP tasks, at which the potential of deep learning has not been fully explored. We propose novel approaches to these tasks using neural models in combination with transfer learning, and our experiments indicate their performance in comparison with baselines. Finally, although we acknowledge that rule-based methods and heuristics might still be superior to deep learning in truly low-resource scenarios, our approaches are more language- and domain-independent, supporting a wider coverage of NLP across languages and domains.
-
(2023)A wide variety of nitrogen-containing compounds present in the air can contribute to air pollution, which in turn affects both human health and the climate. In this thesis, the applicability of two miniaturized air sampling techniques, solid-phase microextraction (SPME) Arrow and in-tube extraction (ITEX) was studied, for the selective collection of nitrogen-containing compounds in air samples. Different types of sorbent materials, including Mobil Composition of Matter No. 41 (MCM-41), titanium hydrogen phosphate-modified MCM-41 (MCM-41-TP), and zinc oxide-modified mesoporous silica microspheres, were used as sorbent materials in the ITEX sampling system. The adsorption and desorption behavior of gaseous nitrogen-containing compounds in passive SPME-Arrow and active ITEX sampling systems, coated and packed with different sorbent materials, was investigated. In addition, saturation vapor pressures of atmospheric trace gases were experimentally and theoretically estimated. The sampling systems with selected sorbent materials were applied to the determination of nitrogen-containing compounds in boreal forest SMEAR II station, indoor air, and cigarette smoke. Adsorbent and adsorbate properties, such as hydrophobicity and basicity, were the major factors that affected sorbent selectivity towards nitrogen-containing compounds. Moreover, the pore volume and pore sizes of the sorbents were essential parameters for the adsorption performance, especially in the SPME Arrow system. The ITEX packing and the SPME Arrow coatings were reproducible and reusable. Due to the active sampling principle, the ITEX sampler with higher adsorption and desorption rates provided better results for the analysis, especially when quick injection was needed in gas chromatography. The selectivity of the ITEX sampling system was increased with the trap accessory, but further study is needed to prevent the loss of the targeted compounds. Whereas the ITEX’s filter accessory was successfully employed to remove particles, enabling ITEX to collect only gas-phase samples. Vapor pressure results were achieved by laboratory experiments (by retention index approach) and by the COSMO-RS model. An aerial drone was successfully employed as a platform to study vertical profiles of VOCs at high altitudes, from 50 to 400 m, for miniaturized SPME Arrow and ITEX atmospheric air sampling systems, along with portable devices for the real-time measurement of black carbon (BC) and total particle numbers. There was a clear distribution of the nitrogen-containing compounds collected at different altitudes at SMEAR II station, Finland, depending on their sources. In addition, other VOCs demonstrated the same trend.
Now showing items 1-2 of 2