Learning from Imbalanced Data Sets
Authors: Alberto Fernández and Salvador García and Mikel Galar and Ronaldo C. Prati and Bartosz Krawczyk and Francisco Herrera
Learning from Imbalanced Data Sets
This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge.
This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way.
This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches.
Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided.
This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.
A Second-Order Statistics Method for Blind Source Separation in Post-Nonlinear Mixtures
In the context of nonlinear Blind Source Separation (BSS), the Post-Nonlinear (PNL) model is of great importance due to its suitability for practical nonlinear problems. Under certain mild constraints on the model, Independent Component Analysis (ICA) methods are valid for performing source separation, but requires use of Higher-Order Statistics (HOS). Conversely, regarding the sole use of the Second-Order Statistics (SOS), their study is still in an initial stage. In that sense, in this work, the conditions and the constraints on the PNL model for SOS-based separation are investigated. The study encompasses a time-extended formulation of the PNL problem with the objective of extracting the temporal structure of the data in a more extensive manner, considering SOS-based methods for separation, including the proposition of a new one. Based on this, it is shown that, under some constraints on the nonlinearities and if a given number of time delays is considered, source separation can be successfully achieved, at least for polynomial nonlinearities. With the aid of metaheuristics called Differential Evolution and Clonal Selection Algorithm for optimization, the performances of the SOS-based methods are compared in a set of simulation scenarios, in which the proposed method shows to be a promising approach.
Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise
The problem of class noisy instances is omnipresent in different classification problems. However, most of research focuses on noise handling in binary classification problems and adaptations to multiclass learning. This paper aims to contextualize noise labels in the context of non-binary classification problems, including multiclass, multilabel, multitask, multi-instance ordinal and data stream classification. Practical considerations for analyzing noise under these classification problems, as well as trends, open-ended problems and future research directions are analyzed. We believe this paper could help expand research on class noise handling and help practitioners to better identify the particular aspects of noise in challenging classification scenarios.
With the increased availability of online services, enhanced authentication mechanisms-including biometric systems-are necessary. However, recent studies show that biometric features can change. Consequently, recognition performance can be affected over time …
Using Taylor Series Expansions and Second-Order Statistics for Blind Source Separation in Post-Nonlinear Mixtures
In the context of Post-Nonlinear (PNL) mixtures, source separation based on Second-Order Statistics (SOS) is a challenging topic due to the inherent difficulties when dealing with nonlinear transformations. Under the assumption that sources are temporally colored, the existing SOS-inspired methods require the use of Higher-Order Statistics (HOS) as a complement in certain stages of PNL demixing. However, a recent study has shown that the sole use of SOS is sufficient for separation if certain constraints on the separation system are …
A greedy search tree heuristic for symbolic regression
Symbolic Regression tries to find a mathematical expression that describes the relationship of a set of explanatory variables to a measured variable. The main objective is to find a model that minimizes the error and, optionally, that also minimizes the expression size. A smaller expression can be seen as an interpretable model considered a reliable decision model. This is often performed with Genetic Programming, which represents their solution as expression trees. The shortcoming of this algorithm lies on this representation that defines a rugged search space and contains expressions of any size and difficulty. These pose as a challenge to find the optimal solution under computational constraints. This paper introduces a new data structure, called Interaction-Transformation (IT), that constrains the search space in order to exclude a region of larger and more complicated expressions. In order to test this data structure, it was also introduced an heuristic called SymTree. The obtained results show evidence that SymTree are capable of obtaining the optimal solution whenever the target function is within the search space of the IT data structure and competitive results when it is not. Overall, the algorithm found a good compromise between accuracy and simplicity for all the generated models.
Analysis of the Twitter Interactions during the Impeachment of Brazilian President
The impeachment process that took place in Brazil on April, 2016, generated a large amount of posts on Internet Social Networks. These posts came from ordinary people, journalists, traditional and independent media, politicians and supporters. Interactions among users, by sharing news or opinions, can show the dynamics of communication inter and intra groups. This paper proposes a method for social networks interactions analysis by using motifs, frequent interactions patterns in network. This method is then applied to analyze data extracted from Twitter during the voting for the impeachment of the Brazilian president. Results of this analysis highlight the behavior of some users by retweeting each other to increase the importance of their opinion or to reach visibility. In addition, interaction patterns reveal that messages from one group (against/in favor of impeachment) rarely propagate to the opposing group. As such, this brings evidence that Social Networks may not stimulate a debate, but reaffirm users’ beliefs.
PolyWaTT: A polynomial water travel time estimator based on Derivative Dynamic Time Warping and Perceptually Important Points
Traditional methods for estimating timing parameters in hydrological science require a rigorous study of the relations of flow resistance, slope, flow regime, watershed size, water velocity, and other local variables. These studies are mostly based on empirical observations, where the timing parameter is estimated using empirically derived formulas. The application of these studies to other locations is not always direct. The locations in which equations are used should have comparable characteristics to the locations from …
Blind channel equalization of encoded data over galois fields
In communication systems, the study of elements and structures defined over Galois fields are generally limited to data coding. However, in this work, a novel perspective that combines data coding and channel equalization is considered to compose a simplified communication system over the field. Besides the coding advantages, this framework is able to restore distortions or malfunctioning processes, and can be potentially applied in network coding models. Interestingly, the operation of the equalizer is possible from a blind …
User profiling of the Twitter Social Network during the impeachment of Brazilian President
The impeachment process that took place in Brazil in April, 2016, has generated a large amount of posts on the Social Networks. These posts came from ordinary people, journalists, traditional and independent media, politicians and supporters. The identification of the impact of this subject on each group of users can be an important analysis to verify the real interest of common Brazilian citizens on this matter. As such, we propose a way to segment the users into popular, activists and observers in order to filter out information and help us give a more detailed analysis of the event. The proposed segmentation may also help other studies related to the usage of Twitter during important events.
Conceptual and Practical Aspects of the aiNet Family of Algorithms
In this paper, a review of the conceptual and practical aspects of the aiNet (Artificial Immune Network) family of algorithms will be provided. This family of algorithms started with the aiNet algorithm, proposed in 2002 for data clustering and, since then, several variations have been developed for data clustering, biclustering and optimization in general. Although the algorithms will be positioned with respect to other pertinent approaches from the literature, the emphasis of this paper will be on the formalization and critical analysis of the set of …
Analysis of a Novel Density Matching Criterion Within the ITL Framework for Blind Channel Equalization
In blind channel equalization, the use of criteria from the field of information theoretic learning (ITL) has already proved to be a promising alternative, since the use of the high-order statistics is mandatory in this task. In view of the several existent ITL propositions, we present in this work a detailed comparison of the main ITL criteria employed for blind channel equalization and also introduce a new ITL criterion based on the notion of distribution matching. The analyses of the ITL framework are held by means of comparison …
Analysis of a Novel Density Matching Criterion Within the ITL Framework for Blind Channel Equalization
In blind channel equalization, the use of criteria from the field of information theoretic learning (ITL) has already proved to be a promising alternative, since the use of the high-order statistics is mandatory in this task. In view of the several existent ITL propositions, we present in this work a detailed comparison of the main ITL criteria employed for blind channel equalization and also introduce a new ITL criterion based on the notion of distribution matching. The analyses of the ITL framework are held by means of comparison …