dc.date.accessioned |
2015-12-18T14:40:44Z |
und |
dc.date.accessioned |
2017-10-24T12:24:06Z |
|
dc.date.available |
2015-12-18T14:40:44Z |
und |
dc.date.available |
2017-10-24T12:24:06Z |
|
dc.date.issued |
2015-12-18T14:40:44Z |
|
dc.identifier.uri |
http://radr.hulib.helsinki.fi/handle/10138.1/5248 |
und |
dc.identifier.uri |
http://hdl.handle.net/10138.1/5248 |
|
dc.title |
Large-scale Multi-Label Text Classification for an Online News Monitoring System |
en |
ethesis.department.URI |
http://data.hulib.helsinki.fi/id/225405e8-3362-4197-a7fd-6e7b79e52d14 |
|
ethesis.department |
Institutionen för datavetenskap |
sv |
ethesis.department |
Department of Computer Science |
en |
ethesis.department |
Tietojenkäsittelytieteen laitos |
fi |
ethesis.faculty |
Matematisk-naturvetenskapliga fakulteten |
sv |
ethesis.faculty |
Matemaattis-luonnontieteellinen tiedekunta |
fi |
ethesis.faculty |
Faculty of Science |
en |
ethesis.faculty.URI |
http://data.hulib.helsinki.fi/id/8d59209f-6614-4edd-9744-1ebdaf1d13ca |
|
ethesis.university.URI |
http://data.hulib.helsinki.fi/id/50ae46d8-7ba9-4821-877c-c994c78b0d97 |
|
ethesis.university |
Helsingfors universitet |
sv |
ethesis.university |
University of Helsinki |
en |
ethesis.university |
Helsingin yliopisto |
fi |
dct.creator |
Pierce, Matthew |
|
dct.issued |
2015 |
|
dct.language.ISO639-2 |
eng |
|
dct.abstract |
This thesis provides a detailed exploration of numerous methods — some established and some novel — considered in the construction of a text-categorization system, for use in a large-scale, online news-monitoring system known as PULS. PULS is an information extraction (IE) system, consisting of a number of tools for automatically collecting named-entities from text. The system also has access to large training corpora in the business domain, where documents are annotated with associated industry-sectors. These assets are leveraged in the construction of a multi-label industry-sector classifier, the output of which is displayed on the web-based
front-end of PULS, for new articles.
Through review of background literature and direct experimentation with each stage of development, we illuminate many major challenges of multi-label classification. These challenges include: working effectively in a real-world scenario that poses time and memory restrictions; organizing and processing semi-structured, pre-annotated text corpora; handling large-scale data sets and label sets with significant class imbalances; weighing the trade-offs of different learning algorithms and feature-selection methods with respect to end-user performance; and finding meaningful evaluations for each system component.
In addition to presenting the challenges associated with large-scale multi-label learning, this thesis presents a number of experiments and evaluations to determine methods which enhance overall performance. The major outcome of these experiments is a multi-stage, multi-label classifier that combines IE-based rote classification — with features extracted by the PULS system — with an array of balanced, statistical classifiers. Evaluation of this
multi-stage system shows improvement over a baseline classifier and, for certain evaluations, over state-of-the-art performance from literature, when tested on a commonly-used corpus. Aspects of the classification method and their associated experimental results have also been published for international conference proceedings. |
en |
dct.language |
en |
|
ethesis.language.URI |
http://data.hulib.helsinki.fi/id/languages/eng |
|
ethesis.language |
English |
en |
ethesis.language |
englanti |
fi |
ethesis.language |
engelska |
sv |
ethesis.thesistype |
pro gradu-avhandlingar |
sv |
ethesis.thesistype |
pro gradu -tutkielmat |
fi |
ethesis.thesistype |
master's thesis |
en |
ethesis.thesistype.URI |
http://data.hulib.helsinki.fi/id/thesistypes/mastersthesis |
|
ethesis.degreeprogram |
Algorithms and Machine Learning |
en |
dct.identifier.urn |
URN:NBN:fi-fe2017112251298 |
|
dc.type.dcmitype |
Text |
|