The document contains 19 statistical interview questions covering topics like the Central Limit Theorem, sampling methods, type I and type II errors, linear regression, assumptions of linear regression, statistical interactions, selection bias, non-Gaussian distributions, the binomial probability formula, statistical power, resampling methods, selection bias, long and wide data formats, the normal distribution, correlation and covariance, point estimates and confidence intervals, A/B testing, p-values, and calculating probabilities over time intervals.
The document contains 19 statistical interview questions covering topics like the Central Limit Theorem, sampling methods, type I and type II errors, linear regression, assumptions of linear regression, statistical interactions, selection bias, non-Gaussian distributions, the binomial probability formula, statistical power, resampling methods, selection bias, long and wide data formats, the normal distribution, correlation and covariance, point estimates and confidence intervals, A/B testing, p-values, and calculating probabilities over time intervals.
The document contains 19 statistical interview questions covering topics like the Central Limit Theorem, sampling methods, type I and type II errors, linear regression, assumptions of linear regression, statistical interactions, selection bias, non-Gaussian distributions, the binomial probability formula, statistical power, resampling methods, selection bias, long and wide data formats, the normal distribution, correlation and covariance, point estimates and confidence intervals, A/B testing, p-values, and calculating probabilities over time intervals.
The document contains 19 statistical interview questions covering topics like the Central Limit Theorem, sampling methods, type I and type II errors, linear regression, assumptions of linear regression, statistical interactions, selection bias, non-Gaussian distributions, the binomial probability formula, statistical power, resampling methods, selection bias, long and wide data formats, the normal distribution, correlation and covariance, point estimates and confidence intervals, A/B testing, p-values, and calculating probabilities over time intervals.
Enable fullscreen Question 1: What is the Central Limit Theorem and why is it important? “Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.” Question 2: What is sampling? How many sampling methods do you know?“Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.” There are two main types of Sampling techniques:Probability SamplingNon-Probability Sampling Question 3: What is the difference between type I vs type II error? “A type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null hypothesis is false, but erroneously fails to be rejected.” Question 4: What is linear regression? A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends on a myriad of factors, such as its size or its location. In order to see the relationship between these variables, we need to build a linear regression, which predicts the line of best fit between them and can help conclude whether or not these two factors have a positive or negative relationship. Question 5: What are the assumptions required for linear regression? There are four major assumptions: 1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable. Question 6: What is a statistical interaction? ”Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.”. Question 7: What is selection bias? “Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis.” Question 8: What is an example of a data set with a non-Gaussian distribution? “The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.” Question 9: What is the Binomial Probability Formula? “The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of π (the Greek letter pi) of occurring. The binomial distribution formula is:b(x; n, P) = nCx * Px * (1 – P)n – xWhere: b = binomial probability x = total number of “successes” (pass or fail, heads or tails, etc.) P = probability of success on an individual trial n = number of trials Question 10: What is statistical power? Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is). Question 11: Explain what resampling methods are and why they are useful. Also explain their limitations .Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample. Resampling refers to methods for doing one of theseEstimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)Validating models by using random subsets (bootstrapping, cross validation) Question 12: What is selection bias, why is it important and how can you avoid it? Selection bias, in general, is a problematic situation in which error is introduced due to a non- random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling, boosting, and weighting are strategies which can be introduced to help deal with the situation. Question 13: What is the difference between “long” and “wide” format data? In the wide-format, a subject’s repeated responses will be in a single row, and each response is in a separate column. In the long-format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups. Question 14: What do you understand by the term Normal Distribution? Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. Figure:Normal distribution in a bell curveThe random variables are distributed in the form of a symmetrical, bell-shaped curve.Properties of Normal Distribution are as follows;Unimodal -one modeSymmetrical -left and right halves are mirror imagesBell-shaped -maximum height (mode) at the meanMean, Mode, and Median are all located in the centerAsymptotic Question 15: What is correlation and covariance in statistics? Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.Covariance: In covariance two items vary together and it’s a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable. Question 16: What is the difference between Point Estimates and Confidence Interval? Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 — alpha, where alpha is the level of significance. Question 17: What is the goal of A/B Testing? It is a hypothesis testing for a randomized experiment with two variables A and B.The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search adsAn example of this could be identifying the click-through rate for a banner ad. Question 18: What is p-value? When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis.Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To put it in another way,High P values: your data are likely with a true null. Low P values: your data are unlikely with a true null. Question 19: In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour? Probability of not seeing any shooting star in 15 minutes is= 1 – P( Seeing one shooting star ) = 1 – 0.2 = 0.8Probability of not seeing any shooting star in the period of one hour= (0.8) ^ 4 = 0.4096Probability of seeing at least one shooting star in the one hour= 1 – P( Not seeing any star ) = 1 – 0.4096 = 0.5904 Question 20: How can you generate a random number between 1 – 7 with only a die? Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36 different outcomes.To get our 7 equal outcomes we have to reduce this 36 to a number divisible by 7. We can thus consider only 35 outcomes and exclude the other one.A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6 appears twice.All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each. This way all the seven sets of outcomes are equally likely. Question 21: A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls? In the case of two children, there are 4 equally likely possibilities BB, BG, GB and GG;where B = Boy and G = Girl and the first letter denotes the first child.From the question, we can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls.Thus, P(Having two girls given one girl) = 1 / 3 Question 22: A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head? There are two ways of choosing the coin. One is to pick a fair coin and the other is to pick the one with two heads.Probability of selecting fair coin = 999/1000 = 0.999 Probability of selecting unfair coin = 1/1000 = 0.001Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair coinP (A) = 0.999 * (1/2)^5 = 0.999 * (1/1024) = 0.000976 P (B) = 0.001 * 1 = 0.001 P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939 P( B / A + B ) = 0.001 / 0.001976 = 0.5061Probability of selecting another head = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 + 0.5061 = 0.7531 Question 23: What do you understand by statistical power of sensitivity and how do you calculate it? Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.).Sensitivity is nothing but “Predicted True events/ Total events”. True events here are the events which were true and model also predicted them as true.Calculation of seasonality is pretty straightforward.Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable ) Question 24: Why Is Re-sampling Done? Resampling is done in any of these cases:Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data pointsSubstituting labels on data points when performing significance testsValidating models by using random subsets (bootstrapping, cross-validation) Question 25: What are the differences between over-fitting and under-fitting? In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data. In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data.Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance. Question 26: How to combat Overfitting and Underfitting? To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model. Question 27: What is regularisation? Why is it useful? Regularisation is the process of adding tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set. Question 28: What Is the Law of Large Numbers? It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate. Question 29: What Are Confounding Variables? In statistics, a confounder is a variable that influences both the dependent variable and independent variable.For example, if you are researching whether a lack of exercise leads to weight gain,lack of exercise = independent variableweight gain = dependent variable.A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject. Question 30: What Are the Types of Biases That Can Occur During Sampling? Selection biasUnder coverage biasSurvivorship bias Question 31: What is Survivorship Bias? It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means. Question 32: What is selection Bias? Selection bias occurs when the sample obtained is not representative of the population intended to be analysed. Question 33: Explain how a ROC curve works? The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate. Question 34: What is TF/IDF vectorization? TF–IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.The TF–IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Question 35: Why we generally use Softmax non-linearity function as last operation in- network? It is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints).Then the i’th component of Softmax(x) is — It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1. Question 37: Python or R – Which one would you prefer for text analytics? We will prefer Python because of the following reasons:Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.R is more suitable for machine learning than just text analysis.Python performs faster for all types of text analytics. Question 38: How does data cleaning plays a vital role in the analysis? Data cleaning can help in the analysis because:Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.Data Cleaning helps to increase the accuracy of the model in machine learning.It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task. Question 39: Differentiate between univariate, bivariate, and multivariate analysis. Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable, and can the analysis can be referred to as univariate analysis.The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sales and spending can be considered as an example of bivariate analysis.The multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses. Question 40: Explain Star Schema .It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster. Question 41: What is Cluster Sampling? Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.For eg., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling. Question 42: What is Systematic Sampling? Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method. Question 43: What are Eigenvectors and Eigenvalues? Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs. Question 44: Can you cite some examples where a false positive is important than a false negative? Let us first understand what false positives and false negatives are.False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error.False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.Example 2: Let’s say an e-commerce company decided to give 1000Giftvouchertothecustomerswhomtheyassumetopurchaseatleast1000𝐺𝑖𝑓𝑡𝑣𝑜𝑢𝑐ℎ𝑒𝑟𝑡𝑜𝑡ℎ𝑒𝑐𝑢𝑠𝑡 𝑜𝑚𝑒𝑟𝑠𝑤ℎ𝑜𝑚𝑡ℎ𝑒𝑦𝑎𝑠𝑠𝑢𝑚𝑒𝑡𝑜𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒𝑎𝑡𝑙𝑒𝑎𝑠𝑡10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above 10,000.Nowtheissueisifwesendthe10,000.𝑁𝑜𝑤𝑡ℎ𝑒𝑖𝑠𝑠𝑢𝑒𝑖𝑠𝑖𝑓𝑤𝑒𝑠𝑒𝑛𝑑𝑡ℎ𝑒1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase. Question 45: Can you cite some examples where a false negative important than a false positive? Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?Example 2: What if Jury or judge decides to make a criminal go free?Example 3: What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after a few years and realize that you had a false negative? Question 46: Can you cite some examples where both false positive and false negatives are equally important? In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure. Question 47: Can you explain the difference between a Validation Set and a Test Set? A Validationset can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model.In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization. Question 48: Explain cross-validation. Cross-validation is a model validation technique for evaluating how the outcomes of statistical analysis will generalize to an independent dataset. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.