A Machine Learning Taxonomic Classifier for Science Publications

Abstract:

The evolution in scientific production, associated with the growing interdomain collaboration of knowledge and the increasing co-authorship of scientific works remains supported by processes of manual, highly subjective classification, subject to misinterpretation. The taxonomy on which this same classification process is based is not consensual, with governmental organizations resorting to taxonomies that do not keep up with changes in scientific areas, and indexers / repositories that seek to keep up with those changes. We find a reality distinct from what is expected and that the domains where scientific work is recorded can easily be misrepresentative of the work itself. The taxonomy applied today by governmental bodies, such as the one that regulates scientific production in Portugal, is not enough, is limiting, and promotes classification in areas close to the desired, therefore with great potential for error. An automatic classification process based on machine learning algorithms presents itself as a possible solution to the subjectivity problem in classification, and while it does not solve the issue of taxonomy mismatch this work shows this possibility with proved results. This paper proposes the use of Machine Learning with Natural Language Processing techniques and suggests some changes on the current taxonomies to improve classification and access to science and concludes with some future work proposals for an increasingly representative classification of evolution in science, which is not intended as airtight, but flexible and perhaps increasingly based on phenomena and not just disciplines.