- Methodology
- Open access
- Published:
Application of variable selection and dimension reduction on predictors of MSE’s development
Journal of Big Data volume 6, Article number: 17 (2019)
Abstract
Nature create variables using its character component, and variables are sharing characters from a vary small to relatively large scale. This results, variables to have from a vary different to a more similar character, and leads to have a relation ship. Literature suggested different relation measures based on the nature of variable and type of relation ship exist. Today, due to having high variety of frequently produced large data size, currently suggested variable filtering and selection methods have gaps to full fill the need. This research desires to fill this gap by comparing literature suggested methods to finding out a better variable selection and dimension reduction methods. The result from regression analysis using all literature suggested factors shows that none of the predictors for development status of enterprise are significant, and only 10 predictors for number of employer in an enterprise are significant out of 81 factors. Since, variable selection and dimension reduction methods are applied to find out predictors of a response by removing variable redundancy, and complexity of incorporating large number variable. Based on statistical power, for the results from variable selection methods, specially association and correlation methods showed that, CANOVA more efficiently detects non-linear or non-monotonic correlation between a continuous–continuous and a continuous-categorical variables. Spearman’s correlation coefficient more efficiently detects a monotonic correlation between a continuous with a continuous, and a continuous with a categorical variable. Pearson correlation coefficient more efficiently detects the linear correlation between continuous variables. MIC efficiently detects non-linear or non-monotonic relation between continuous variables. Chi-square test of independence efficiently detects relation between a continuous with a continuous, and categorical with categorical variables, but the non linear or non monotonic relation between a continuous with a categorical are not well detected. On the other hand, the result from lasso and stepwise methods reveals that, the relation between the predictor and response due to interaction effect not detected by correlation and association methods are detected by stepwise variable selection method, and the multicollinearity is detected and removed by lasso method. Regressing the response variable “number of employer in an enterprise” based on variables selected by lasso and stepwise method does bring greater model fitness (based on adjusted R-squared value) than variables selected by association and correlation methods. Similarly, regressing the response variable “development status of an enterprise” based on variables selected by association and correlation methods does bring 12 significant variables, where none of variables are significant from variables selected by lasso and stepwise methods. As a result, 51 predictors for number of employment in an enterprise, and 40 predictors for development status of an enterprise are detected as significantly related variables. And, lasso and stepwise methods are preferred to select predictors of a continuous response variable “number of employers in an enterprise”, and association and correlation methods are preferred to select predictors of a categorical response variable “development status of an enterprise”. Finally, the reduced regression models result reveals that, 20 predictors have causal relation with number of employment in an enterprise, and 12 predictors have causal relation with development status of an enterprise. On the other hand, based on model fitness, information lost, and number of significant factors, principal factor is preferred and applied in dimension reduction for a categorical response variable “development status of an enterprise”, and factor score based regression is preferred and applied for a continuous response variable “number of employers in an enterprise”. However, the comparison of the results in variable selection and dimension reduction indicates that, variable selection methods gave more gain in model fitness than dimension reduction methods. Hence, the suggested variable selection methods are more preferred than dimension reduction methods, and applied to find out predictors. In general, the suggested procedure for variable selection methods are recommended when small number of variables are studied, and the suggested dimension reduction methods are recommended for large number of variant variables (Big data case).
Introduction
Nature create variables using its character component, and variables are sharing characters from a vary small to relatively large scale. This results, variables to have from a vary different to a more similar character. Variables having a more similar character are variables sharing largely a more similar character component (have relatively the same composition), and apparently a vary small similarity is due to high difference in component character composition. Hence, taking variables having more similar character as one variable or taking one of them as a representative can remove natural character redundancy, and it helps to mange and analyse the relation ship between variables in a world of large amount of variables are inter-related. This inter-relation between the variables causes the variables to have a direct causal relation, or an indirect causal relation or relation with out causal nature. Statistically, a direct causal relation indicates the presence of dependency between variables, where as indirect causality is due to the presence of latent variable. However, the relation between the variables without known causality is due to not well understood relation in the real world. The relation between variables can be linear or non-linear or random. Statistical methods like, variable-selection and variable-dimension-reduction methods can used to reduce the number of variable by taking single variable or merging as a component for statistically significantly similar variables.
Measuring the predictor–predictor relation, and response–predictor relation is important to recognize the relationship exist, and having a short list of influential factors for further analysis to determine their effect on response variable.
However, due to inter-relation between dependent variables, their influence on response variable is not only individual rather in group too. Since, the natural inter-relation between variable is not captured and considered by simulation study, or by predictor–response association or correlation measures only. Correspondingly, this interaction effect is planed to detected for real data using Micro and small enterprise (MSE’s) data set[File Name: MSEs.csv] by considering the predictors filtered by association, correlation and regression measures for predictor–predictor and predictor–response relation. Then, the possible combination of selected (filtered) groups of variables are then regressed for response variable, and significantly and potentially related variables are re-selected using stepwise and lasso variable selection method.
Statistical measures of association, correlations and regression are used to find out the relation exist between variables. In this research the statistical relation measures used for variable selection, and dimension reduction are, Pearson correlation coefficient, Spearman’s rank correlation coefficient, Chi-square test of independence, maximal information criterion (MIC), continuous analysis of variance test (CANOVA), stepwise variable selection and lasso variable selection, and Principal factor and Factor score analysis respectively.
Wang et al. [20] used simulated and real datasets (kidney cancer RNA-seqdataset) to compare the false positive rates and statistical power of CANOVA to six other methods (Distance correlation’s, Hoeffding’s independence test, CANOVA the Pearson correlation coefficient, the Spearman’s rank correlation coefficient, the Kendall’s rank correlation coefficient and the Maximal information coefficient), and showed that CANOVA, the Pearson correlation coefficient, the Spearman’s rank correlation coefficient, the Kendall’s rank correlation coefficient and the MIC gave the expected false positives. Hence, these methods can detect the true significant variables. However, the false positive rate is lower than the expected for distance correlation and higher than the expected for Hoeffding’s independence test. So the true significant variables may not be detected by distance correlation, and there may be false significant variables in Hoeffding’s independence test result. Hence, Pearson correlation were recommended when correlation between two continuous variable is linear, and CANOVA were recommended when the correlation between two continuous variable is non-linear or complicated.
Variable dimension reduction is a tool to avoid complexity due to having large number of variables by considering the possible small number of variables those can reflect the needed information: which arise due to some variables are highly correlated to each other or to latent variable, or from the set of variables some variables may accounted for large amount of variability in the data set. For this type of problem variable reduction methods like principal factor analysis and factor score analysis are suggested [1, 2].
Currently due to having high variety of frequently produced big data size, literature suggested variable filtering and selection methods have gaps to full fill the need. Hence, this research desires to fill this gap by finding out a better variable filtering, selection and dimension reduction methods using real data. The above statistical methods of variable-selection and variable-dimension-reduction are applied to reduce the number of variable by taking single variable or merging as a component for statistically significantly similar variables.
Data and variable
From literature, entrepreneur’s development is measured in relation to the success of an individual, society, and firm survival [3, 4]. Bosma et al. [4] measured development of enterprise by considering profits of the entrepreneur, employment created by the entrepreneur, and the survival period of the firm. The determinants for development of entrepreneurs are dependent on the starting human capital, social capital, financial capital and strategies applied on business.
Coduras et al. [5] construct a measure for an individual’s readiness for entrepreneurship based on three main categories: sociological, psychological and managerial–entrepreneurial. The South African small enterprise development agency perform a study based on literature and current data for the impact of 2008 and 2009 global financial crisis on South Africa’s SMMEs, and they suggests that the South Africa’s SMMEs are challenged by access to finance and markets, poor infrastructure, labour laws, crime, skills shortages and inefficient bureaucracy. Assefa et al. [7] perform a study on factors affecting the success of Micro and Small-scale Enterprises in Addis Ababa and five other major regional towns in Ethiopia and find out the key success factors are personal qualities, such as having an articulate vision or ambition and innate abilities, working experience in the formal sector as a factory employee or having worked in family businesses, managerial and entrepreneurial skills, and higher equity in the invested money. Whereas shortage and small size of credit, shortage of working and sales spaces, lack of rental machinery and stringent licensing requirements are constraints of MSEs.
The sample data is taken from Debre Markos town enterprises in 2017. The study units are individuals starting their business in the interval of a year 1994 to 2006 and currently working on their own enterprise or business. The respondents gave detailed information on their entrepreneurial knowledge, skill and experience, on business environment and their strategies. Additional information on enterprises were also taken from Trade and industry office of Debre Markos town.
Sampling method of a study is determined based on the nature of the population under study. Ethiopian Ministry of Urban Development and Housing (MoUDH) classify micro and small size enterprise into five sectors, namely Manufacturing sector, Service sector, Trade, Construction sector, service sector, and Mining and Quarrying Sector. However, based on the present Trade and industry office of Debre Markos town MSEs are re-classified as Manufacturing sector, Service sector, Trade, Urban farming and Construction sector, by splitting Service sector in to service and Urban Farming. Hence, enterprises across sector are more heterogeneous than within sector, stratified sampling method is the right choice. The sample size is determined by using stratified optimal allocation based on the strata’s variance calculated from the information (secondary data) obtained from Trade and industry office: for the situation in which the variable of interest is enterprise development status which is categorical with value 1 (achieved expected progress stated by MoUDH) and 0 (not achieved expected progress), and at \(99\%\) level of confidence for the true population proportion to be in 0.05 interval of the sample proportion, 179 sample of enterprise is taken from a total of 2093 enterprises. The study unites are allocated to each strata by considering strata’s variance rather than proportion, due to high difference in strata’s size where some clusters have size less than 20 and some larger than a thousand [8].
Variable of the study
Under these study two dependent and 81 independent variables are considered. List of explanatory variables considered are listed in Appendix: Tables 12, 13, 14, 15, 16 and 17.
Dependent variable
The variable of interest is enterprise development status. Bosma et al. [4] measured Entrepreneurs development (which is individual approach to measure enterprise development status ) in relation to, the success of an individual like profit made and capital growth, the success of society based on employee capacity, and firm survival. Contextually, Ethiopian Ministry of Urban Development and Housing (MoUDH) state a measure for development status of micro and small size enterprise based on the progress made by an enterprise on their capital accumulation and human capital mainly in terms of number of employee [3]. The MoUDH definition for micro and small enterprise is given by Table 1.
Correspondingly, on this study enterprise development status is measured based on the progress made by an enterprise which is a categorical variable with value 1 (achieved expected progress) and 0 (not achieved expected progress), and by number of employers in an enterprise as defined by MoUDH.
Explanatory variables
Explanatory variables or factors those have direct or indirect influences on interest variable is the concern need to dig out to find out relevant solution on achieving the planed enterprise development by controlling influential variables. As stated on literature by Bosma et al. [4] determinants for development of entrepreneurs are related to starting human capital, social capital, financial capital, and strategies applied on business.
In general, literature suggested measures of control variables, human capital, financial capital, influencing factors, social capital, and information’s relevant for the development of their businesses are considered [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19].
Variable-selection method
Chi-squared test of independence
Chi-square test of independence is one of the statistical measures that tests the linear and non-linear association between variables. This test helps to determine whether variables are independent of each other or whether there is pattern of dependency between variables. Formally, chi square test of independence determine whether the observed pattern between the variables is strong enough to show that the two variables are dependent on each other, or by considering all possible combinations of variables events and testing for the independence of each pair of these events. If the probability of occurrence of the different possible values of one variable depend on which category of another variable occurs, then the two variables are dependent on each other. Chi-square variable have a continuous distribution obtained by the sum of the squares of a set of normally distributed variables. Chi-square distribution is a rightly skewed distribution with lower limit at 0 and declines as \(\chi ^2\) increases to the right with most of values near the center of the distribution. Since, theoretical distribution of chi square distribution is a continuous distribution, and the chi square statistic have discrete distribution, chi square statistic is approximated by the theoretical chi square distribution for reasonably large sample size or for expected number of cases exceed 5 in most cells of the cross classification table. The wildly used rule on expected cases are less than 1 and no more than 20% of expected cases have less than 5 per category. The chi square test for independence is conducted by assuming that there is no relationship (independent) between the two variables being examined versus an alternative hypothesis clam: there is some relationship (dependency) between the variables. Under the null hypothesis of no relationship between variables, the expected cases for each of the cell can be obtained from the multiplication rule of probability for independent events.
Continuous analysis of variance test (CANOVA)
CANOVA is a measure for non linear correlation between two continuous variables, as an extension to ANOVA for continuous variables by making generalization on “within category variance”. CANOVA first define a neighborhood for each data point of response variable based on its predictor value, and then the variance of the response value within the neighborhood is calculated. The hypothesis of CANOVA “similar neighbor predictor values lead to similar response values” is tested for smaller value of statistic “within neighborhood sum square” compared to “random expectation”. Since, a statistic “within neighborhood variance” does not follow any familiar distribution, its significance is tested by permutation test. The grid of a larger K has more power on slow-varying functions, while a smaller K has more power on quick-oscillating functions depending on the data. The suggested choice for the neighborhood structure of the dataset is n/20 [20]. CANOVA is related to local regression (like, K nearest neighbor (kNN) regression), and CANOVA can be viewed as an analogy of the model fitness test of the kNN model as Pearson’s correlation coefficient can be viewed as the model fitness test of a linear regression model. This method reduce algorithm complexity to O(nlogn+np) by ordering the data values of response with respect to the ordered value of predictors, and can easily explore the non linear correlation between two continuous variable.
Maximal information criterion (MIC)
MIC is an equitable maximal information-based non-parametric exploration (MINE) statistic for identifying and classifying relationships. This implies, in addition to measuring association, MIC measures non-linear relation between two random variables, and the degree of linear relation between variables having functional relationships. In general, with sufficient sample size it captures all type of functional relationships even that are not well modelled. MIC assigns a score measures strength of relationship in a rage of 0 to 1, where a score of 0 to statistically independent variables, and a score of 1 in probability for noiseless functional relationships. For large data set with many variables (Big data) which contain important and undiscovered relationships, MINE helps in identifying and characterizing structures in data for variable selection or dimension reduction purpose [21].
Pearson correlation coefficient
The Pearson correlation coefficient is the most commonly used correlation method to measure a two-way linear correlation, calculated by dividing covariance of two variables by the product of their standard deviations’. Its value is represented by \((r_{xy})\) in a range between − 1 and 1. If the points \((x_i , y_i )\) are in a perfect straight line and the slope of that line is positive, \((r_{xy})=1\). If the points are in a perfect straight line and the slope is negative, \((r_{xy}) = -\,1\). If there is no systematic relation between X and Y at all, \((r_{xy}) \simeq\) 0, and \((r_{xy})\) differs from zero only because of random variation in the sample points.
Coefficient of determination which is the square of Pearson correlation between a response and an explanatory variable \((R_{xy}^{2} = r_{xy}^{2})\) represents the fraction of the total variance around the mean value \(\bar{y}\) that is explained by the linear relation between \(x_i\) and y. Therefore, using \((R_{xy}^{2})\) as a variable ranking criterion enforces a ranking according to goodness of linear fit of individual variables. However, Pearson correlation measures only linear dependency between variables [22].
Spearman’s rank correlation coefficient
Spearman’s rank correlation coefficient is non-linear rank based non-parametric test of correlation. Its value is between − 1 and 1 and interpreted in the same way as Pearson correlation coefficient for ranked variables. Spearman’s rank correlation coefficient state an alternative hypothesis of the correlation between two variables corresponds to a monotonic function.
Stepwise variable selection
Backward elimination or Forward selection or Stepwise elimination can be used to select variable in the model. Backward elimination starts using all variable and variables with high P-value or above critical value are removed until the rest are significant. Forward selection starts with no variable and the variable not in the model with P-value less than critical value are inserted until the left are not significant. Stepwise elimination is the combination of them, variables are added or removed earlier in the process and the process chose the best collection of variable which maximize model fitness. Stepwise elimination is not exactly dependent on P-value rather it consider the importance of the variable in the model, this results the method to be more power full in prediction. Hence, Stepwise elimination is used to measure the interaction effect of predictors on response variable base on minimum AIC criterion [23].
Lasso variable selection
Lasso minimises the residual sum of square subject to the sum of the absolute value of the coefficients less than a constant. Lasso is help full to improve prediction accuracy by reducing large variance made by OLS trough shrinking some coefficients to zero. In this study, lasso variable selection method is applied at optimum lambda (which is in range of 1 standard deviation of minimum lambda) [24].
Dimension reduction methods
Principal factor and factor score analysis
Principal component analysis is helpful to describes the variance-covariance structure between the set of variables through a few uncorrelated new latent variables called principal components. However, the lack of correlation between principal components dose not reflect the natural correlation present on represented real variables. Therefore, a method that allow relatively slight correlation between components, like factor analysis, is preferable. factor analysis is can be applied after the number of components needed is decided to construct principal factors and factor scores. The decision for number of principal component needed can be done by considering the bend of the scree plot for principal components variances, the variance or eigenvalue of the principal component greater than one, the proportion of the total variation a counted by principal components, and subject matter consideration on principal factors composition [1]. The factor model for the random variables vector \(\mathbf Y '\) \(= [Y_1, Y_2, \ldots Y_p]\) with mean vector \(\mu\) and covariance matrix \(\Sigma\) is given as follow:
where \(\Lambda\) is \(p \times k\) matrix of unknown constants called loadings, F is a \(k \times 1\) vector of common factors and \(\varepsilon\) is a \(p\times p\) diagonal matrix of specific factors. The estimates needed from this model are: covariance between factors and variables: \(Cov(\mathbf F ,\mathbf Y )=\mathbf L\) or \(Cov( Y_i, F_j ) = l_{ij}\), Communality: \(h_i^2 = \sum _{j=1}^{k}l_{ij}^2\), and Uniqueness: \(\phi _i=Var(Y_i) - h_i^2\) for \(i,~ j=1,2,3, ... , p.\)
The \(i^{th}\) communality (\(h_i^2\)) indicates the portion of the variance of \(Y_i\) explained by k common factors and \(i^{th}\) uniqueness (\(\phi _i\)) indicates the portion of variance of Y \((Var (Y_i)\ )\) explained by the \(i^{th}\) specific factors. Estimated principal factors are constructed by linear combination of variables and their corresponding loadings.
From the result of factor model the estimated factor scores are also constructed by linear combination of original variables having relatively large loading on the factor.
where \(l_{ij}=1\) if the variable i have relatively large loading on the factor j, else \(l_{ij}=0\) [2].
Model
Linear regression
For the data consist of a random response variable Y (number of employer in an enterprise) and k = 81 fixed explanatory variables, \(X_1 , X_2 , \ldots , X_k\) with sample of size n = 179, linear regression is used to fit the parameter estimates and find out influential factors which determine number of employer in an enterprise. The relationship between Y and \(X_1 ,~ X_2 , \ldots , X_k\) is formulated as a linear model:
where \(\beta _0 , \beta _1 , \ldots , \beta _k\) are constants referred to as regression model coefficients and \(\epsilon\) is a random disturbance.
It is assumed that Y is approximately a linear function of the \(X's\), and \(\epsilon\) measures the discrepancy in that approximation or \(\epsilon\) contains no systematic information for determining Y that is not already captured by the \(X's\) [25].
Logistic regression
Enterprise development status is a binary response variable with measured values Y = 1 (achieved expected progress) or Y = 0 (not achieved expected progress). Which is modelled by logistic regression model. This model is used to show the relationship between p(y) and x’s for the random component have binomial distribution where \(0\le p(y) \le 1\).
The mean and variance of the p(y) is np and \(np(1-p)\) respectively, where
Logistic regression makes no assumption about the distributions of the independent variables. They do not have to be normally distributed, linearly related or of equal variance with each group. In this study, logistic regression is used to find out the influential factors from suggested predictors of enterprise development status. The influence of determinant factors are assessed individually and component wise on enterprise development status. It is modelled as follow:
where X is a matrix of independent variable, or principal factors, or factor scores in the model, \(\beta\) is vector of coefficients of the model, and \(\beta _0\) is intercept of the model [26].
Result and discussion
Linear regression result for the number of employment using all 81 literature suggested factors showed in Appendix: Tables 12, 13, 14, 15, 16 and 17reveals that only 10 variables are significant (those are, h4, h3, IF4, IF8, Grouping, X15.29, X50.65, ed0, ed1, and \(emp\_male\)) with 0.9992 adjusted R-squared, and similarly the result in Appendix: Tables 12, 13, 14, 15, 16 and 17 for logistic regression of development status of enterprise indicates none of the predictors are significant out of 81 factors. To address this problem variable selection and dimension reduction methods are applied to find out the real predictors of a response by removing variable redundancy, and complexity of having large number of variable.
Variable selection
The result of tests for the relation between number of employment in enterprise and predictors indicated in Table 2 reveals that, number of employers in an enterprise is significantly related at 95% confidence level with 40 explanatory variables out of 81 predictors listed in Appendix: Tables 12, 13, 14, 15, 16 and 17. Specifically, this result suggested that, as the number of employer in an enterprise increase, employment by gender is proportional, employment by eduction category is also significantly increased mainly employer with primary education is employed largely, and employment by age category is significantly increased for category between 30–49 and 50–65. But, the number of employer between age category 15 to 29 is decreases as the number of employer in an enterprise increases. Enterprise created by group, employer taking specific education or training on entrepreneurship, employer graduate from TVET are significantly directly correlated with the growth of enterprise’s employability. Apparently, having relation with entrepreneurs for advise like as friend and any one in contact is negatively correlated with number of employment in an enterprise. The result also indicate current capital and Government investment policy motivation by Land are significantly directly correlated with number of employment in an enterprise. The influence of religion, traditionalism (cultural tackle), problems related to the legal licensing, telecommunication problems, and lack of necessary and timely marketing information have significant direct correlation with the number of employers in an enterprise. The problem of keep up with literature, get information from customers, get information from suppliers, get information from banks, and get information from commercial cooperation is higher as number of employment in an enterprise increases. The development status of an enterprise have significant have negative correlation with the number of employers in an enterprise. In addition, Starting capital, educational level, experience in self-employment, managerial experience, financial experience (financing the business), experience in the sector, firm duration, experience in business, corruption, number of employers on age category above 65, having entrepreneurs in the family, type of MSEs (micro or small), and experience as an employee have significant association with number of employment in an enterprise.
CANOVA helps to detect the relation exist between a continuous and categorical variable (only CANOVA with k = 10 detects type of MSEs has significant correlation with number of employment in an enterprise increases, and CANOVA have high power to detect the correlation exist between In5 (get information from suppliers) and number of employment in an enterprise increases). However, almost all significant variables detected by CANOVA are detected by Pearson or Spearman’s correlation coefficient, mainly by Spearman’s correlation coefficient. MIC also detects some non linear relation between some continuous variable with high power (Currk, \(Emp_0\), X15.29, X30.49, \(emp\_male\), and \(emp\_Female\).
The result of tests for the relation between the development status of an enterprise and explanatory variables indicated in Table 3 reveals that, the development status of enterprise is significantly related at 95% confidence level with 28 explanatory variables out of 81 predictors listed in Appendix: Tables 12, 13, 14, 15, 16 and 17. This result specifically suggested that, enterprise created by group, employer with age between 15 to 29, employer taking specific education or/and training on entrepreneurship, and employer graduate from TVET are significantly directly correlated with the development of an enterprise’s. The development status of an enterprise is directly significantly correlated with level of education, an enterprise with employer graduated from high school, collage or University. The influence of religion, and electric power or energy problem also increases with development status of an enterprise. The influence of availability of raw material, fear of failure, environmental conditions, problems related to the legal licensing are less on development of an enterprise. The development status of an enterprise have significant direct correlation with the current number of employers in an enterprise and even at the start-up. The development of an micro enterprise enterprise is better than small enterprise. There is also an evidence of starting a business in group could bring a better development than an individual owned business, similarly male owned enterprises are more successful. Government investment policy motivation by land has also direct significant correlation with development of an enterprise. So government investment policy motivation is helpful for success of an enterprise. Having experience in the sector (your business), financial experience (financing the business), working by business plan, employment growth goal (the desire/want to employee), managerial skills, and experience in business have direct significant correlation with development of an enterprise. Mainly, formal managerial skills and financial experience have significant correlation with the development of an enterprise. In addition, bad experience of own have significant association with the development of an enterprise. The result indicated that, only CANOVA for k = 2 find out entrepreneurs activeness on business services is significantly negatively correlated with development status of an enterprise. MIC detected some non-linear relation with high power (Currk, MSEs, and Category). However, almost all significant variables detected by CANOVA are detected by Pearson or Spearman’s correlation coefficient, mainly by Spearman’s correlation coefficient.
Conclusion based on statistical power, the result from association and correlation analysis suggested that, CANOVA more efficiently detects continuous–continuous, and continuous-categorical non-linear or non-monotonic relation. Spearman’s correlation coefficient more efficiently detects a continuous–continuous or a continuous-categorical monotonic relationship. Pearson correlation coefficient more efficiently detects the relation between continuous variables. MIC more efficiently detects non-linear or non-monotonic continuous-continuous relation. Chi-square test of independence efficiently detects relation between a continuous with a continuous, and categorical with categorical variables, but the non linear or non monotonic relation between a continuous with a categorical are not well detected. On the other hand, the results from stepwise and lasso variable selection method in Table 5 shows that, 31 variables are detected significantly as predictor for number of employment in an enterprise, and from which eleven of them are new predictors comparing to the result in association and correlation methods given in Table 2. The result using this method in Table 7 also indicates that 21 variables are significantly detected as predictors for development status of an enterprise and from which eleven of them are new predictors comparing to the result in association and correlation methods given in Table 3. Since, association and correlation can not detect the relation due to interaction effect. Similarly, some of non-causal relation between a predictor and response are not detected by lasso and stepwise variable selection methods are detected by correlation and association methods. Specifically, twenty new variables are selected as predictor for number of employment in an enterprise and nineteen new variables are selected as predictor for development status of an enterprise.
Model result from selected variables
Linear regression
1. Influencing factors affecting number of employment in an enterprise are assessed based on casual linear relation with significantly related (correlated or/union associated) predictors Table 2. Significant variables are selected based on Stepwise elimination with minimum AIC criterion, and by lasso variable selection method. Stepwise elimination bring less number of significant variables comparing to lasso variable selection. However, both method have their own input, stepwise elimination brings three new variables (ed1, ed3, h3) those are not significant by lasso, and lasso method also brings five new variables (h2, Category, \(Emp_0\), ed2, number of employer from \(50 \ {\rm to} \ 65\)) those are not significant by stepwise elimination. The selected variables by both methods are separately modelled, and the result in Table 4 reveals IF8, grouping, number of employer from age 15–29 and 30–49, \(emp\_male\), \(emp\_female\), and h4 are significant for both methods, where ed0, ed1, h3, and number of employer aged above 65 are only significant by stepwise elimination, similarly h2 and number of employer from age from 50 to 65 are only significant by lasso method. Finally, the variables selected by both methods are merged and the result for reduced model reveals a greater number of significant variables with equivalent model fitness as indicated in Table 4. The significance of all variables included in reduced model, unlike the lasso and stepwise selected variables, is an indication of lower multicollinearity between incorporated variables. This implies that, the predictors of number of employment in an enterprise should be the selected variable in reduced model.
2. Here, influencing factors affecting number of employment in an enterprise are assessed using all literature suggested factors in Table 5 by regression method (stepwise elimination and lasso variable selection). Significant factors are selected based on Stepwise elimination with minimum AIC criterion, and by lasso variable selection method at optimum lambda (which is in range of 1 standard deviation of minimum lambda). Unlike, the above result Table 4, regression of variables selected by stepwise elimination brings more number of significant variables comparing to variables selected by lasso method. However, both method have their own input in variable selection, stepwise elimination bring threaten new variables (c3, c6, h3, IF1, s3, s4, In1, In2, In3, In7, In10, ed1, and ed3), where five of them are not significant, but the removal of insignificant variables (IF1, s3, s4, In7 and In10) result in reduction of multiple R-squared and adjusted R-squared from 0.9946 to 0.9942, and 0.9937 to 0.9935 respectively. In addition, two significant variables In1 and In3 become insignificant. So these variables are potential variable and have to stay in the model. On the other hand, lasso method brings eight new variables (h2, h14, IF3, Category, \(Emp_0\), X30.49, ed2, \(emp\_female\)) of which three of them are only significant. The removal of insignificant variables (h14, IF3, Category, \(Emp_0\), and ed2), resulted in reduction of multiple R-squared and adjusted R-squared from 0.9938 to 0.9932, and 0.9931 to 0.9928 respectively. However, there is no significant variable became insignificant due to the removal of those variables. This is an indication that stepwise elimination considers the gain due to interaction effect but it can result in multicollinearity, where as lasso method removes multicollinearity and the gain due to interaction effect is not considered. Due to the advantages of lasso method on controlling multicollinearity and stepwise elimination in considering interaction effect, variables selected by both stepwise elimination and lasso method are merged, and the result for reduced model reveals a greater number of significant variables with equivalent model fitness as indicated in Table 5.
Logistic regression
1. Influencing factors affecting development status of an enterprise are assessed based on casual relation of significantly related (correlated or/union associated) predictors Table 3. Significant variables in the model are selected based on stepwise elimination with minimum AIC criterion, and lasso method at minimum lambda. Stepwise elimination does brings more variables at lower AIC than lasso method. However, both method have their own input in variable selection, stepwise elimination bring 14 new variables and of eight of them are significant variables (Grouping, IF8, StartK, IF9, IF5, s1, s2, and In4), and lasso method does bring six new variables (X15.29, ed2, h10, IF14, and s4). The variables selected by both methods are merged and the result for reduced model reveals a greater number of variables in the model with equivalent model fitness as indicated in Table 6, and reflects that, Grouping, IF8, IF10, CurrK, StartK, IF9, IF5, s1, MSEs, s2, In4, and f5 are significant factors on development status of an enterprise where ed1, Category, h2, h10, c11, In6, h3, and h4 are potential factors.
2. Influencing factors affecting development status of an enterprise are assessed using all literature suggested factors Table 7. Significant variables in the model are selected based on stepwise elimination at minimum AIC criterion, and by lasso variable selection method at minimum lambda. As a result stepwise elimination does brings more variables at lower AIC than lasso method. However, predictors selected by lasso method are only significant. The result for reduced model contains more variable with lower AIC, but none of the variables are significant. Hence, lasso variable selection dose in better power.
As conclusion Comparison of the results for reduced linear regressions model of variables selected by association and correlation method Table 4 with variables selected by regression method Table 5 revealed that, the earlier method does bring one new variable (\(emp\_male\)) and the latter one does bring eight new variables (those are, IF6, X50.65, c3, c6, In1, In2, In3, and In7) with greater adjusted R-squared. This reveals that, based on the number of significant variables and model fitness (based on adjusted R-squared value), variables selected by lasso and stepwise elimination are taken as predictors of number of employer in an enterprise, those are listed on Table 5. Specifically, number of employer in an enterprise has significant casual relation with full self-employment, previous habitat is urban, Graduated from TVET, taken specific education/training on entrepreneurship, having other income source, environmental conditions, religion, contact with entrepreneurs in networks may be socially, visiting Bazaar, taking businesses courses, reading literatures on business, get information about business from commercial cooperation, Working MSEs in group, employers with education back ground who can not read and write, and who complete primary education, high females employment, high number of employer age between 15 to 29, 30 to 49, and above 65, and low number of employer aged between 50 to 65.
On the other hand, for categorical response variable “development status of an enterprise” the result in Tables 6 and 7 indicates that, more significant number of variables are find out by association and correlation methods, where non of variables are significant by lasso and stepwise methods with some more AIC value (with more information lost). Hence, the predictors for development status of an enterprise are variables listed in Table 6. Specifically development of an enterprise status has significant casual relation with working MSEs in group, religion, telecommunication problems, traditionalism (cultural tackle), current capital, corruption, entrepreneurs in the family, entrepreneurs in the friends, get information from customers, government investment policy motivation by land, and status of MSEs is being small. The development of an enterprise status is potentially related with employers with primary education, category of MSEs, year of experience in business, environmental conditions, educational level, Graduated from TVET, Specific education/training on entrepreneurship, and financial experience (financing the business).
Hence, lasso and stepwise variable selection methods are suggested for continuous response variable, and association and correlation methods are suggested for categorical response variable; or alternatively, variable selection method by combing both association, correlation, and regression method can bring a better result.
Dimension reduction
Explanatory factor analysis were applied using varimax rotation on principal components to reduce variable dimension for a purpose of avoiding complexity due to having large number of variables with out losing the needed information. Based on a result indicated in Appendix: Tables 12, 13, 14, 15, 16 and 17, 11 principal components each having a minimum of variance equal to 2, which accounts for 50.8% of the total variation in data set were taken by considering the subject matter and the bend point of a scree-plot of principal components shown in Fig. 1 too. And then, factor elements with at least 0.3 score (loading) are selected. Specifically, Factor 1 is related to Human and starting capital, Factor 2 contrasts potential input of an enterprise with influencing factors, Factor 3 contrasts an enterprise getting information from partners with an idolised enterprise, Factor 4 is related to knowledge on business mainly by training, education or courses, Factor 5 contrasts own business input with partner support, Factor 6 contrasts policy related influencing functors to Human capital, Factor 7 contrasts Entrepreneurs act for success of an enterprise with Entrepreneurs social resource, Factor 8 related to number of employer in an enterprise per categories of gender, education, and age, Factor 9 contrasts own contribution with partners, Factor 10 contrasts entrepreneurs nature with enterprise status, Factor 11 contrasts number of employers per category with entrepreneurs potential.
Model result for dimension reduction
Linear regression
The linear regression result for the number of employer in an enterprise based on factor scores reveals Table 8, factor 5 (contrasts of own business input with partner support), factor 6 (contrasts of policy related influencing functors to Human capital), factor 8 (variables related to Number of employer in an enterprise per categories of gender), factor 10 (contrasts of entrepreneurs nature with enterprise status), and factor 11 (contrasts number of employers per category with entrepreneurs potential) have significant affect on number of employer in an enterprise and those factors explain 82% of the variation in mean number of employer in an enterprise.
The linear regression result for number of employer in an enterprise based on principal factor reveals Table 9, principal factor 5 (contrasts of own business input with partner support), principal factor 6 (contrasts of policy related influencing functors to human capital), principal factor 7 (contrasts entrepreneurs act for success of an enterprise with entrepreneurs social resource), principal factor 9 (contrasts own contribution with partners), principal factor 10 (contrasts of entrepreneurs nature with enterprise status), and principal factor 11 (contrasts number of employers per category with entrepreneurs potential) are the significant factors those explain 85% of the variation in mean number of employer in an enterprise.
The result from regression analysis using facto score and principal factor indicates that regression analysis using principal factor gain more model fitness with one more factor. Even though, four factors are significant by both methods, factor 8 is only significant by factor score based regression, and factor 7 and 9 are only significant by principal factor based regression. Since, the result from principal factor based regression brings little gain in model fitness with complex composition (since it consider all variables than factor scores, that makes difficult to relate principal factors to real component) comparing to factor score based regression, the factor score based regression is more preferable.
Logistic regression
The logistic regression result for development status based on factor score reveals Table 10, factor 4 (related to knowledge on business mainly by training), factor 7 (contrasts Entrepreneurs act for success of an enterprise with Entrepreneurs social resource), and factor 10 (contrasts of entrepreneurs nature with enterprise status) are the significant factors with AIC of 183.16.
The logistic regression result for principal factor of development status reveals Table 11, principal factor 2 (contrasts potential input of an enterprise with Influencing factors), principal factor 3 (contrasts an enterprise getting information from partners with an idolised enterprise), principal factor 8 (Variables related to Number of employer in an enterprise per categories of gender), principal factor 9 (contrasts own contribution with partners), principal factor 10 (contrasts of entrepreneurs nature with enterprise status), and principal factor 11 (contrasts number of employers per category with entrepreneurs potential) are the significant factors with AIC of 128.348.
The result from logistic regression analysis using factor score and principal factor indicates that logistic regression analysis using principal factor brings more significant factors. Principal factor based logistic regression give 6 significant factors, where factor score based logistic regression brings 3 significant factors with lower AIC comparatively. Hence, principal factor based logistic regression is suggestible. Therefore, principal factor is applied in dimension reduction for a response variable is development status of an enterprise, and factor score based regression is applied in dimension reduction for a response variable is number of employers in an enterprise.
Conclusion
Regression analysis result using all literature suggested factors shows that none of the predictors for development status of an enterprise are significant, and only 10 predictors for the number of employer in an enterprise are significant out of 81 factors. As a result variable selection and dimension reduction methods are applied to assess the real predictors of a response by removing variable redundancy, and complexity of having much variable. Analysis for variable selection is done using correlation and association methods, and regression (lasso and stepwise variable selection) methods. Related variable selection using association and correlation methods based on statistical power indicates that: CANOVA is more efficiently detects the non-linear or non-monotonic correlation between a continuous–continuous and a continuous-categorical variables. As Wang et al. [20] indicates the relation between continuous variables is well detected with more power, even if the number of significantly detected variables is smaller. Where as Spearman’s correlation coefficient more efficiently detects a continuous–continuous and a continuous–categorical monotonic correlation, and Pearson correlation coefficient more efficiently detects the linear correlation between continuous variables, this result is supported by literatures [20, 22]. In addition, MIC more efficiently detects a non-linear or non-monotonic relation between continuous variables [21]. More ever, Chi-square test of independence efficiently detects relation between a continuous with a continuous, and categorical with categorical variables, but the non linear or non monotonic relation between a continuous with a categorical are not well detected. Tsai et al. [27] also suggested Chi-square in pre-processing step during data mining.
The result also reveals that, the relation between the predictor and response due to interaction effect not detected by correlation and association methods are detected by lasso and stepwise variable selection methods. Specifically, eleven new predictors for the number of employment in an enterprise, and 11 new predictors for development status of an enterprise are significantly detected by lasso and stepwise variable selection methods only. Similarly, some non-causal relation between the predictor and response are not detected by lasso and stepwise variable selection methods are also detected by correlation and association methods. Specifically, twenty new variables are significantly detected as predictor for the number of employment in an enterprise and nineteen new variables are significantly detected as predictor for development status of an enterprise by correlation and association methods only. In general, as result of Tables 2 and 5 for a continuous response variable “number of employer in an enterprise”, and Tables 3 and 7 for a categorical response variable “ development status of an enterprise”, 51 predictors for the number of employment in an enterprise, and 40 predictors for development status of an enterprise are significantly detected. The result in literature [3, 4, 6, 7] does support the methodology applied is more general and efficient in grassing possible factors.
The result mainly indicates that, regressing the response variable “number of employer in an enterprise” based on variables selected by lasso and stepwise method does bring greater model fitness (based on adjusted R-squared value) than variables selected by association and correlation methods. Similarly, regressing the response variable “development status of an enterprise” based on variables selected by association and correlation methods does bring 12 significant variables, where none of variables are significant by lasso and stepwise elimination. Hence, lasso and stepwise variable selection methods are suggested for continuous response variable “number of employment in an enterprise”, and association and correlation methods are suggested for categorical response variable “development status of an enterprise”; or alternatively filtering variables by regression, correlation and association methods and merging them for further analysis is also suggestible.
On the other hand, the result from principal factor based regression for the number of employers in an enterprise shows that, the gain in model fitness is small with complex composition comparing to factor score based regression. But the result from logistic regression analysis for development status of an enterprise using factor score and principal factor indicates that logistic regression analysis using principal factor brings more significant number of factors with smaller information lost. Therefore, principal factor is preferred and applied in dimension reduction for a categorical response variable “development status of an enterprise”, and factor score is preferred and applied in dimension reduction for a continuous response variable “number of employers in an enterprise”.
The comparison of results from variable selection and dimension reduction methods indicated that, variable selection methods brings more gain in model fitness than dimension reduction methods. Hence, the suggested variable selection methods are more preferred than dimension reduction methods, and applied to find out predictors and reveals the following results.
Number of employer in an enterprise has significant casual relation with full self-employment, previous habitat is urban, Graduated from TVET, taken specific education/training on entrepreneurship, having other income source, environmental conditions, religion, contact with entrepreneurs in networks may be socially, visiting Bazaar, taking businesses courses, reading literatures on business, get information about business from commercial cooperation, Working MSEs in group, employers with education back ground who can not read and write, and who complete primary education, high females employment, high number of employer age between 15 to 29, 30 to 49 and above 65, and low number of employer aged between 50 to 65.
Development of an enterprise status has significant casual relation with working MSEs in group, religion, telecommunication problems, traditionalism (cultural tackle), current capital, corruption, entrepreneurs in the family, entrepreneurs in the friends, get information from customers, government investment policy motivation by land, and status of MSEs is being small. The development of an enterprise status is potentially related with employers with primary education, category of MSEs, year of experience in business, environmental conditions, educational level, graduated from TVET, specific education/training on entrepreneurship, and financial experience (financing the business).
In general, the suggested variable selection methods are recommended when small number of variables are studied, and the suggested dimension reduction methods are recommended for large number of variant variables (Big data case).
Future work
In this paper the measures for relation between variables are suggested based on the nature of variable. The relation due to interaction effect need more efficient method than stepwise elimination method which can consider the importance of each variable interaction effect in addition to model improvement. Due to current and recent need in Big data, a general comprehensive variable filtering and selection method should be a future work.
Abbreviations
- AIC:
-
minimum information criterion
- CANOVA:
-
continuous analysis of variance
- kNN:
-
K nearest neighbour
- MIC:
-
maximal information criterion
- MINE:
-
maximal information-based non-parametric exploration
- MSEs:
-
micro and small enterprises
- MSME:
-
micro, small, and medium enterprises
- MoUDH:
-
Ministry of Urban Development and Housing
- OLS:
-
ordinary least square
- PC:
-
principal component
- SMMEs:
-
small, medium and micro enterprises
- TVET:
-
technical vocational educational training
References
Wubetie HT. Missing data management and statistical measurement of socio-economic status: application of big data. J Big Data. 2017;4(1):47.
Johnson RA, Wichern DW. Applied multivariate statistical analysis, vol. 4. 5th ed. NJ: Prentice hall Englewood Cliffs; 2002.
Micro and small enterprise development policy and strategy. Ministry of Urban Development and Housing (MoUDH), 2nd edition, Addis Ababa. 2012.
Bosma N, van Praag M, de Wit G. Determinants of successful entrepreneurship. Research report 0002/E.2000.
Coduras A, Saiz-Alvarez JM, Ruiz J. Measuring readiness for entrepreneurship: an information tool proposal. J Innovat Knowl. 2016;1:99–108.
The small, medium and micro enterprise sector of South Africa. Commissioned by the small enterprise development agency. Research Note 2016.
Assefa B, Zerfu A, Tekle B. Identifying key success factors and constraints in Ethiopia’s MSE development: an exploratory research. Addis Ababa: Ethiopian Development Research Institute; 2014.
Cochran WG. Sampling techniques. 3rd ed. New York: Wiley; 1977.
Lynch R, Jin Z. Exploring the institutional perspective on international business expansion: towards a more detailed conceptual framework. J Innovat Knowl. 2016;1:117–24.
Alves H, Ferreira JJ, Fernandes CI. Customer’s operant resources effects on co-creation activities. J Innovat Knowl. 2016;1:69–80.
Ozkan-Canbolat E, Beraha A. Configuration and innovation related network topology. J Innovat Knowl. 2016;1:91–8.
Pavel R. Social entrepreneurship and vulnerable groups. J Commun Posit Pract. 2011;2:59–77.
Federal Democratic Republic of Ethiopia Ministry of Trade and Industry. Micro and small enterprises development strategy. Addis Ababa: Federal Democratic Republic of Ethiopia Ministry of Trade and Industry; 1997.
Federal Micro and Small Enterprises Development Agency. FEMSEDA annual report. 2011.
Federal Micro and Small Enterprises development Agency. Micro and small enterprises development strategy, provision framework and methods of implementation. Addis Ababa: Federal Micro and Small Enterprises development Agency; 2011.
Federal Micro and Small Enterprises development Agency. Competitive business formation guideline. Addis Ababa: Federal Micro and Small Enterprises development Agency; 2012.
Government of the Federal Democratic Republic of Ethiopia. Micro and small enterprise development strategy, provision framework and methods of implementation. Addis Ababa: Government of the Federal Democratic Republic of Ethiopia; 2011.
The Federal Democratic Republic of Ethiopia Central Statistical Agency. Statistical report on the 2013 national labour force survey. Addis Ababa: The Federal Democratic Republic of Ethiopia Central Statistical Agency; 2014.
Klapper L, Amit R, Guillén MF. Entrepreneurship and firm formation across countries. In: Lerner J, Schoar A, editors. International differences in entrepreneurship. National Bureau of Economic Research Conference Report. Chicago: University of Chicago Press; 2010.
Wang Y, Li Y, Cao H, Xiong M, Shugart Yin Yao, Jin Li. Efficient test for nonlinear dependence of two continuous variables. BMC Bioinform. 2015;16:260. https://doi.org/10.1186/s12859-015-0697-7.
Reshef D, Reshef Y, Finucane H, Grossman S, McVean G, Turnbaugh P, Lander E, Mitzenmacher M, Sabeti P. Detecting novel associations in large datasets. Science. 2011;334:6062.
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.
Faraway JJ. Practical regression and anova using R. 2002.
Tibshirani R. Regression shrinkage and selection via the lasso. Toronto: University of Toronto; 1994.
Chattefuee S, Hadi AS. Regression analysis by example. 4th ed. New York: Wiley; 2006.
Agresti Alan. Categorical data analysis. 3rd ed. New York: Wiley; 2013.
Tsai C-F, Chen M-Y. Variable selection by association rules for customer churn prediction of multimedia on demand. London: Elsevier; 2009. p. 0957–4174. https://doi.org/10.1016/j.eswa.2009.06.07.
Authors' contributions
This research is performed by TWH. The author read and approved the final manuscript.
Acknowledgements
The author forwards his heartfelt gratitude to anonymous reviewers for their careful reading of the manuscript and their helpful comments that improve the presentation of this work. The author also thanks Debre Markos Micro and small enterprise Authority office for MSEs data source access and respected Debre Markos enterprise and business men.
Competing interests
The author declare no competing interests.
Availability of supporting data
All support data files are available.
Consent for publication
Author proves consent of publication for this research.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Wubetie, H.T. Application of variable selection and dimension reduction on predictors of MSE’s development. J Big Data 6, 17 (2019). https://doi.org/10.1186/s40537-018-0153-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537-018-0153-4