Machine Learning Bits
Machine Learning Bits
Machine Learning Bits
Which of the following is a widely used and effective machine learning algorithm based on the
idea of bagging?
Decision Tree
Regression
Classification
Random Forest D
2. To find the minimum or the maximum of a function, we set the gradient to zero because:
Both A and B
3. The most widely used metrics and tools to assess a classification model are:
Confusion matrix
Cost-sensitive accuracy
Both A and B
Factor analysis
To judge how the trained model performs outside the sample on test data
Both A and B
To remove stationarity
Both A and B
9. when performing regression or classification, which of the following is the correct way to
preprocess the data?
Both A and B
Using too large a value of lambda can cause your hypothesis to underfit the data.
Using too large a value of lambda can cause your hypothesis to overfit the data.
Using a very large value of lambda cannot hurt the performance of your hypothesis.
14. How can you prevent a clustering algorithm from getting stuck in bad local optima?
Both A and B
15. Which of the following techniques can be used for normalization in text mining?
Stemming
Lemmatization
Both A and B D
16. In which of the following cases will K-means clustering fail to give good results? 1) Data points
with outlier 2) Data points with different densities 3) Data points with non convex shapes
1 and 2
2 and 3
1, 2, and 3
1 and 3 C
17. Which of the following is a reasonable way to select the number of principal components "k"?
Choose k to be the smallest value so that at least 99% of the varinace is retained.
18. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration.
You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of
the following conclusions seems most plausible?
Rather than using the current value of a, use a larger value of a (say a=1.0)
Rather than using the current value of a, use a smaller value of a (say a=0.1)
It is used to parse sentences to derive their most likely syntax tree structures.
20. Suppose you have trained a logistic regression classifier and it outputs a new example x with a
prediction ho(x) = 0.2. This means
PCA
K-Means
None of the above
Both A and B A
22. "Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic
Gradient Decent (SGD)? 1.In GD and SGD, you update a set of parameters in an iterative manner
to minimize the error function. 2.In SGD, you have to run through all the samples in your training
set for a single update of a parameter in each iteration. 3.In GD, you either use the entire data or a
subset of training data to update a parameter in each iteration. "
Only 1
Only 2
Only 3
1 and 2 A
23. Which of the following hyper parameter(s), when increased may cause random forest to over fit
the data? 1. Number of Trees 2.Depth of Tree 3.Learning Rate"
24. Below are the 8 actual values of target variable in the train file.[0,0,0,1,1,1,1,1]. What is the
entropy of the target variable?
25. "Let’s say, you are working with categorical feature(s) and you have not looked at the distribution
of the categorical variable in the test data. You want to apply one hot encoding (OHE) on the
categorical feature(s). What challenges you may face if you have applied OHE on a categorical
variable of train dataset?
All categories of categorical variable are not present in the test dataset
Both A and B D
26. Let’s say, you are using activation function X in hidden layers of neural network. At a particular
neuron for any given input, you get the output as “-0.0001” Which of the following activation
function could X represent?
ReLU
tanh
SIGMOID
None of these B
TRUE FALSE B
28. "Which of the following statements is/are true about “Type-1” and “Type-2” errors?
3.Type1 error occurs when we reject a null hypothesis when it is actually true."
Only 1
Only 2
Only 3
1 and 3 D
29. "Which of the following is/are one of the important step(s) to pre-process the text in NLP based
projects?? Stemming? Stop word removal? Object Standardization"
1 and 2
1 and 3
2 and 3
1,2 and 3 D
30. "Suppose you want to project high dimensional data into lower dimensions. The two most
famous dimensionality reduction algorithms used here are PCA and t-SNE.Let’s say you have
applied both algorithms respectively on data “X” and you got the datasets “X_projected_PCA” ,
“X_ projected_ tSNE”.Which of the following statementsis true for “X_projected_PCA” &
“X_projected_tSNE” ?
31. "Adding a non-important feature to a linear regression model may result in.
1.Increase in R-square 2.Decrease in R-square"
Only 1 is correct
Only 2 is correct
Either 1 or 2
None of these A
32. "Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for (X,
Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively. Now, you have added 2 in all values of X (i.e
new values become X+2), subtracted 2 from all values of Y (i.e. new values are Y-2) and Z
remains the same. The new coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 & D3
respectively. How do the values of D1, D2 & D3 relate to C1, C2 & C3?
D1 = C1, D2 = C2, D3 = C3
Cannot be determined C
33. "Imagine, you are solving a classification problem with highly imbalanced class. The majority
class is observed 99% of times in the training data. Your model has 99% accuracy after taking the
predictions on test data. Which of the following is true in such a case?
3.Precision and recall metrics are good for imbalance class problems.
4.Precision and recall metrics aren’t good for imbalanced class problems"
1 and 3
1 and 4
2 and 3
2 and 4 A
34. "In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of
these models will give a better prediction than prediction of individual models.Which of the
following statements is / are true for weak learners used in ensemble model?
2.They have high bias, so they cannot solve complex learning problems
35. "Which of the following options is/are true for K-fold cross-validation? 1.Increase in K will result in
higher time required to cross validate the result. 2.Higher values of K will result in higher confidence on
the cross-validation result as compared to lower value of K. 3.If K=N, then it is called Leave one out
cross validation, where N is the number of observations.
1 and 2
2 and 3
1 and 3
1,2 and 3 d
Not possible
None of these A
37."It is possible to construct a k-NN classification algorithm based on this black box alone.
TRUE FALSE A
38. "Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black box.
Which of the following option is correct for finding k-NN using j-NN?1.J must be a proper
1 2 3 4 A
39.Which of the following value of K will have least leave-one-out cross validation accuracy?
1NN
3NN
4NN
40."Suppose we have a dataset which can be trained with 100% accuracy with help of a
decision tree of depth 6. Now consider the points below and choose the option based
on these points.Note: All other hyper parameters are same and other factors are not affected.
1.Depth 4 will have high bias and low variance 2.Depth 4 will have low bias and low
variance
"
Only 1
Only 2
Both 1 and 2
41 "Which of the following options can be used to get global minima in k-Means Algorithm? 1.Try to run
algorithm for different centroid initialization 2.Adjust number of iterations 3.Find out the optimal number
of clusters"
2 and 3
1 and 3
1 and 2
All of above D
42 "For which of the following hyper parameters, higher value is better for decision tree algorithm?
1.Number of samples used for split 2.Depth of tree 3.Samples for leaf"
1 and 2
2 and 3
1 and 3
Can’t say D
43 What is the dimension of output feature map when you are using the given parameters.
44 What is the dimensions of output feature map when you are using following parameters.
45. k-NN algorithm does more computation on test time rather than train time.
TRUE FALSE A
none of these C
47. "Which of the following statement is true about k-NN algorithm?1.k-NN performs much better if all
of the data have the same scale 2.k-NN works well with a small number of input variables (p),but
struggles when the number of inputs is very large3.k-NN makes no assumptions about the functional form
of the problem being solved"
1 and 2
1 and 3
Only 1
48. Which of the following machine learning algorithm can be used for imputing missing values of both
categorical and continuous variables?
K-NN
Linear Regression
Logistic Regression A
49. Which of the following is true about Manhattan distance?
None of these A
50. "Which of the following distance measure do we use in case of categorical variables in k-NN?
1.Hamming Distance 2.Euclidean Distance3.Manhattan Distance"
1 2 3 4 A
51.Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
1 2 4 8 A
52.Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?
1 2 4 8 A
Can’t say
None of these A
Can’t say
None of these B
55.When you find noise in data which of the following option would you consider in k-NN?
None of these A
56."In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option
would you consider to handle such problem? 1.Dimensionality Reduction 2.Feature selection"
57. "Which of the following is/are true about Random Forest and Gradient Boosting ensemble methods?
1.Both methods can be used for classification task 2.Random Forest is use for classification whereas
Gradient Boosting is use for regression task 3.Random Forest is use for regression whereas Gradient
Boosting is use for Classification task 4.Both methods can be used for regression task"
1 2 4 1 AND 4 D
58. Which of the following algorithm are not an example of ensemble learning algorithm?
Random Forest
Decision Trees
Extra Trees
Gradient Boosting B
59."Suppose you are using a bagging based algorithm say a Random Forest in model building. Which of
the following can be true? 1. Number of tree should be as large as possible 2.You will have
interpretability after using Random Forest"
60. A _________ is a decision support tool that uses a tree-like graph or model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility.
Decision tree
Graphs
Trees
Neural Networks A
Worst, best and expected values can be determined for different scenarios
64."Which of the following is true for neural networks? (i) The training time depends on the size of the
network.(ii) Neural networks can be simulated on a conventional computer.(iii) Artificial neurons are
identical in operation to biological ones."
(ii) is true
Hill-climbing search
Depth-first search
Breadth-first search B
Literal
Temporal model
Reality model
Probability model
All of the mentioned A
Neural network
Random Forest
k-Nearest neighbor
69. Increase in size of a convolutional kernel would necessarily increase the performance of a
convolutional neural network.
TRUE FALSE B
70. Which of the following categories would be suitable for this type of problem?
Fine tune only the last couple of layers and change the last layer (classification layer) to
regression layer
Freeze all the layers except the last, re-train the last layer
None of these A
71. Suppose you have 5 convolutional kernel of size 7 x 7 with zero padding and stride 1 in the first layer
of a convolutional neural network.You pass an input of dimension 224 x 224 x 3 through this layer. What
are the dimensions of the data which the next layer will receive?
217 x 217 x 3
217 x 217 x 8
218 x 218 x 5
220 x 220 x 7 C
72. "Suppose we have a neural network with ReLU activation function. Let’s say, we replace
ReLu activations by linear activations.Would this new neural network be able to approximate an XNOR
function? Note: The neural network was able to approximate XNOR function with activation function
ReLu."
YES NO B
73. "Which of the following is a data augmentation technique used in image recognition tasks?
1.Horizontal flipping 2.Random cropping 3.Random scaling. 4.Color jittering 5.Random
translation.6.Random shearing
1, 2, 4
2, 3, 4, 5, 6
1, 3, 5, 6
All of these D
74. "Given an n-character word, we want to predict which character would be the n+1th character in the
sequence. For example, our input is “predictio”(which is a 9-character word) and we have to predict what
would be the 10th character.Which neural network architecture would be suitable to complete this task?"
75.What is generally the sequence followed when building a neural network architecture for semantic
segmentation for image?
76. What is the technical difference between vanilla back propagation algorithm and back propagation
through time (BPTT) algorithm?
Unlike backprop, in BPTT we subtract gradients for corresponding weight for each
time step A
77. "Exploding gradient problem is an issue in training deep networks where the gradient gets so large
that the loss goes to an infinitely high value and then explodes. What is the probable approach when
dealing with “Exploding Gradient” problem in RNNs?"
Gradient clipping
Dropout
None of these B
78.Which of the following is not a direct prediction technique for NLP tasks?
Recurrent Neural Network
Skip-gram model
PCA
79. Back propagation works by first calculating the gradient of ___ and then propagating it backwards.
80. A recurrent neural network can be unfolded into a full-connected neural network with infinite length
TRUE FALSE A
81. It is generally recommended to replace pooling layers in generator part of convolutional generative
adversarial nets with ________ ?
Affine layer
ReLU layer C
82. In a neural network, knowing the weight and bias of each neuron is the most important step. If you
can somehow get the correct value of weight and bias for each neuron, you can approximate any function.
What would be the best way to approach this?
Search every possible combination of weights and biases till you get the best value
Iteratively check that after assigning a value how far you are from the best values,
and slightly change, the assigned values ,values to make them better
None of these C
83. "What are the steps for using a gradient descent algorithm? 1.Calculate error between the actual value
and the predicted value 2.Reiterate until you find the best weights of network 3.Pass an input through the
network and get values from output layer 4.Initialize random weight and bias 5.Go to each neurons
which contributes to the error and change its respective values to reduce the error"
1, 2, 3, 4, 5
5, 4, 3, 2, 1
3, 2, 1, 5, 4
4, 3, 1, 5, 2 D
84.“Convolutional Neural Networks can perform various types of transformation (rotations or scaling) in
an input”. Is the statement correct True or False?
TRUE FALSE B
85 Which of the following techniques perform similar operations as dropout in a neural network?
Bagging
Boosting
Stacking
None of these A
Convolution function
87. "What is the sequence of the following tasks in a perceptron? 1.Initialize weights of perceptron
randomly. 2.Go to the next batch of dataset. 3.If the prediction does not match the output, change the
weights 4.For a sample input, compute an output"
1, 2, 3, 4
4, 3, 2, 1
3, 1, 2, 4
1, 4, 3, 2 D
YES NO A
Both A and B D
90. The number of neurons in the output layer should match the number of classes (Where the number of
classes is greater than 2) in a supervised learning task. True or False?
TRUE FALSE B
91.In a neural network, which of the following techniques is used to deal with overfitting?
Dropout
Regularization
Batch Normalization
All of these D
92. "Y = ax^2 + bx + c (polynomial equation of degree 2) Can this equation be represented by a neural
network of single hidden layer with linear threshold?"
YES NO B
A unit which does not respond completely to any of the training patterns
None of these A
94. Which of the following statement is the best description of early stopping?
Train the network until a local minimum in the error function is reached
Simulate the network on a test dataset after every epoch of training. Stop training when the
generalization error starts to increase
Add a momentum term to the weight update in the Generalized Delta Rule, so that training
converges more quickly
BOTH
Can’t Say B
96. Suppose a convolutional neural network is trained on ImageNet dataset (Object recognition dataset).
This trained model is then given a completely white image as an input . The output probabilities for this
input would be equal for all classes. True or False?
TRUE FALSE B
97. When pooling layer is added in a convolutional neural network, translation in-variance is preserved.
True or False?
TRUE FALSE A
98. Which gradient technique is more advantageous when the data is too big to handle in
RAM simultaneously?
99.For a classification task, instead of random weight initializations in a neural network, we set all the
weights to zero. Which of the following statements is true?
There will not be any problem and the neural network will train properly
The neural network will train but all the neurons will end up recognizing the same
thing
The neural network will not train as there is no net gradient change
None of these B
100 For an image recognition problem (recognizing a cat in a photo), which architecture of
Perceptron B
101."What are the factors to select the depth of neural network? 1.Type of neural network (eg. MLP
CNN etc) 2.Input data 3.Computation power, i.e. Hardware capabilities and software capabilities
4.Learning Rate 5.The output function to map"
1, 2, 4, 5
2, 3, 4, 5
1, 3, 4, 5
All of these D
102. Consider the scenario. The problem you are trying to solve has a small amount of data.
Fortunately, you have a pre-trained neural network that was trained on a similar problem. Which of the
following methodologies would you choose to make use of this pre-trained network?
Assess on every layer how the model performs and only select a few of them
Freeze all the layers except the last, re-train the last layer D
103 Increase in size of a convolutional kernel would necessarily increase the performance of a
convolutional network
TRUE FALSE B
Kernel SVM
Neural Networks
105 In which of the following applications can we use deep learning to solve the problem?
All of these D
106 Which of the following statements is true when you use 1×1 convolutions in a CNN?
108 The number of nodes in the input layer is 10 and the hidden layer is 5.
The maximum number of connections from the input layer to the hidden layer are
50
Less than 50
More than 50
It is an arbitrary value A
109 The input image has been converted into a matrix of size 28 X 28 and a kernel/filter
of size 7 X 7 with a stride of 1. What will be the size of the convoluted matrix?
22 X 22
21 X 21
28 X 28
7X7 A
110 In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices between
[1 X 5] , [5 X 8]
[8 X 5] , [ 1 X 5]
[8 X 5] , [5 X 1]
[5 x 1] , [8 X 5] D
111.Which of the following functions can be used as an activation function in the output layer
if we wish to predict the probabilities of n classes (p1, p2..pk) such that sum of p over all n equals to 1?
Softmax
ReLu
Sigmoid
Tanh A
112. Assume a simple MLP model with 3 neurons and inputs= 1,2,3. The weights to the input
neurons are 4,5 and 6 respectively. Assume the activation function is a linear constant
32 643 96 48 C
113. Which of following activation function can’t be used at output layer to classify an image
114 In the neural network, every parameter can have their different learning rate.
TRUE FALSE A
TRUE FALSE A
116 Which of the following neural network training challenge can be solved using batch
normalization?
Overfitting
Both B and C D
117 Which of the following would have a constant input in each epoch of training a Deep Learning
model?
118 Changing Sigmoid activation to ReLu will help to get over the vanishing gradient issue?
TRUE FALSE A
TRUE FALSE B
TRUE FALSE B
121 Suppose there is an issue while training a neural network. The training loss/validation loss
Both of these
NONE C
2: Dropout demands high learning rate 3: Dropout can help preventing overfitting"
Both 1 and 2
Both 1 and 3
Both 2 and 3
All 1, 2 and 3 B
123 Gated Recurrent units can help prevent vanishing gradient problem in RNN.
TRUE FALSE A
Data Augmentation
Weight Sharing
Early Stopping
How accurately the SVM can predict outcomes for unseen data
126 When the C parameter is set to infinite, which of the following holds true?
The optimal hyperplane if exists, will be the one that completely separates the data
128 The minimum time complexity for training an SVM is O(n2). According to this fact,
Large datasets
Small datasets
Selection of Kernel
Kernel Parameters
TRUE FALSE A
132 Suppose you are using RBF kernel in SVM with high Gamma value. What does this signify?
The model would consider even far away points from hyperplane for modeling
The model would consider only the points close to the hyperplane for modeling
The model would not be affected by distance of points from hyperplane for modeling
133 Which of the following are real world applications of the SVM?
Image Classification
KNN
K Means
Random Forest A
Memorization
Analogy
Deduction
Introduction D
136 Which of the following is an example of a deterministic algorithm?
Predictor variable
Independent variable
Response variable
dependent vairable A
138 Which of the following can be used to impute data sets based only on information in the training
set. ?
postProcess
preProcess
process
RMSE
RSquared
Accuracy
cl_forecast
cl_nowcast
cl_precast
141 Which algorithm is used for small and large data sets
SVM
RF
NAÏVE BAYES
DECISION TREES A
RF
SVM
Decsion Trees
KNN A
Classification
Clustering
Regression
Association C
Supervised
Unsupervised
Reinforcement
None C
145 In which type of learning the input andpredicted output is given for training data
Supervised
Unsupervised
Reinforcement
None A
146 In which type of algorithms the distance between two points is uesd for identying neighbour