Machine Learning Bits

1.
Which of the following is a widely used and effective machine learning algorithm based on the
idea of bagging?
Decision Tree
Regression
Classification
Random Forest D
2. To find the minimum or the maximum of a function, we set the gradient to zero because:
The value of the gradient at extreme a of a function is always zero
Depends on the type of problem
Both A and B
None of the above A
3. The most widely used metrics and tools to assess a classification model are:
Confusion matrix
Cost-sensitive accuracy
Area under the ROC curve
All of the above D
4. Which of the following is a good test dataset characteristic?
Large enough to yield meaningful results
Is representative of the dataset as a whole
Both A and B
None of the above C
5. Which of the following is a disadvantage of decision trees?
Factor analysis
Decision trees are robust to outliers
Decision trees are prone to be overfit
None of the above C
6. How do you handle missing or corrupted data in a dataset?

Drop missing rows or columns
Replace missing values with mean/median/mode
Assign a unique category to missing values
All of the above D
7. "What is the purpose of performing cross-validation"
To assess the predictive performance of the models
To judge how the trained model performs outside the sample on test data
Both A and B
None of the above C
8. Why is second order differencing in time series needed?
To remove stationarity
To find the maxima or minima at the local point
Both A and B
None of the above C
9. when performing regression or classification, which of the following is the correct way to
preprocess the data?
Normalize the data ? PCA ? training
PCA ? normalize PCA output ? training
Normalize the data ? PCA ? normalize PCA output ? training
None of the above A
10. Which of the following is an example of feature extraction?
Constructing bag of words vector from an email
Applying PCA projects to a large high-dimensional data
Removing stop words in a sentence
All of the above D
11. What is pca components in Sklearn?
Set of all eigen vectors for the projection space

Matrix of principal components
Result of the multiplication matrix
None of the above options A
12. Which of the following is true about Naive Bayes ?
Assumes that all the features in a dataset are equally important
Assumes that all the features in a dataset are independent
Both A and B
None of the above options C
13. Which of the following statements about regularization is not correct?
Using too large a value of lambda can cause your hypothesis to underfit the data.
Using too large a value of lambda can cause your hypothesis to overfit the data.
Using a very large value of lambda cannot hurt the performance of your hypothesis.
None of the above D
14. How can you prevent a clustering algorithm from getting stuck in bad local optima?
Set the same seed value for each run
Use multiple random initializations
Both A and B
None of the above B
15. Which of the following techniques can be used for normalization in text mining?
Stemming
Lemmatization
Stop Word Removal
Both A and B D
16. In which of the following cases will K-means clustering fail to give good results? 1) Data points
with outlier 2) Data points with different densities 3) Data points with non convex shapes
1 and 2
2 and 3
1, 2, and 3
1 and 3 C
17. Which of the following is a reasonable way to select the number of principal components "k"?
Choose k to be the smallest value so that at least 99% of the varinace is retained.
Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
Choose k to be the largest value so that 99% of the variance is retained.
Use the elbow method A
18. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration.
You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of
the following conclusions seems most plausible?
Rather than using the current value of a, use a larger value of a (say a=1.0)
Rather than using the current value of a, use a smaller value of a (say a=0.1)
a=0.3 is an effective choice of learning rate
None of the above C
19. What is a sentence parser typically used for?
It is used to parse sentences to check if they are utf-8 compliant.
It is used to parse sentences to derive their most likely syntax tree structures.
It is used to parse sentences to assign POS tags to all tokens.
It is used to check if sentences can be parsed into meaningful tokens. B
20. Suppose you have trained a logistic regression classifier and it outputs a new example x with a
prediction ho(x) = 0.2. This means
Our estimate for P (y=1 | x)
Our estimate for P (y=0 | x) B
21. Which of the following is an example of a deterministic algorithm?
PCA
K-Means
None of the above
Both A and B A
22. "Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic
Gradient Decent (SGD)? 1.In GD and SGD, you update a set of parameters in an iterative manner
to minimize the error function. 2.In SGD, you have to run through all the samples in your training
set for a single update of a parameter in each iteration. 3.In GD, you either use the entire data or a
subset of training data to update a parameter in each iteration. "
Only 1
Only 2
Only 3
1 and 2 A
23. Which of the following hyper parameter(s), when increased may cause random forest to over fit
the data? 1. Number of Trees 2.Depth of Tree 3.Learning Rate"
Only 1 Only 2 Only 3 1 and 2 B
24. Below are the 8 actual values of target variable in the train file.[0,0,0,1,1,1,1,1]. What is the
entropy of the target variable?
-(5/8 log(5/8) + 3/8 log(3/8))
5/8 log(5/8) + 3/8 log(3/8)
3/8 log(5/8) + 5/8 log(3/8)
5/8 log(3/8) – 3/8 log(5/8) A
25. "Let’s say, you are working with categorical feature(s) and you have not looked at the distribution
of the categorical variable in the test data. You want to apply one hot encoding (OHE) on the
categorical feature(s). What challenges you may face if you have applied OHE on a categorical
variable of train dataset?
All categories of categorical variable are not present in the test dataset
Frequency distribution of categories is different in train as compared to the test dataset.
Train and Test always have same distribution.
Both A and B D
26. Let’s say, you are using activation function X in hidden layers of neural network. At a particular
neuron for any given input, you get the output as “-0.0001” Which of the following activation
function could X represent?
ReLU
tanh
SIGMOID
None of these B
27. LogLoss evaluation metric can have negative values.
TRUE FALSE B
28. "Which of the following statements is/are true about “Type-1” and “Type-2” errors?
1.Type1 is known as false positive and Type2 is known as false negative.
2.Type1is known as false negative and Type2 is known as false positive.
3.Type1 error occurs when we reject a null hypothesis when it is actually true."
Only 1
Only 2
Only 3
1 and 3 D
29. "Which of the following is/are one of the important step(s) to pre-process the text in NLP based
projects?? Stemming? Stop word removal? Object Standardization"
1 and 2
1 and 3
2 and 3
1,2 and 3 D
30. "Suppose you want to project high dimensional data into lower dimensions. The two most
famous dimensionality reduction algorithms used here are PCA and t-SNE.Let’s say you have
applied both algorithms respectively on data “X” and you got the datasets “X_projected_PCA” ,
“X_ projected_ tSNE”.Which of the following statementsis true for “X_projected_PCA” &
“X_projected_tSNE” ?
X_projected_PCA will have interpretation in the nearest neighbour space.
X_projected_tSNE will have interpretation in the nearest neighbour space.

Both will have interpretation in the nearest neighbour space.
None of them will have interpretation in the nearest neighbour space. B
31. "Adding a non-important feature to a linear regression model may result in.
1.Increase in R-square 2.Decrease in R-square"
Only 1 is correct
Only 2 is correct
Either 1 or 2
None of these A
32. "Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for (X,
Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively. Now, you have added 2 in all values of X (i.e
new values become X+2), subtracted 2 from all values of Y (i.e. new values are Y-2) and Z
remains the same. The new coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 & D3
respectively. How do the values of D1, D2 & D3 relate to C1, C2 & C3?
D1= C1, D2 < C2, D3 > C3
D1 = C1, D2 > C2, D3 < C3
D1 = C1, D2 = C2, D3 = C3
Cannot be determined C
33. "Imagine, you are solving a classification problem with highly imbalanced class. The majority
class is observed 99% of times in the training data. Your model has 99% accuracy after taking the
predictions on test data. Which of the following is true in such a case?
1.Accuracy metric is not a good idea for imbalanced class problems.
2.Accuracy metric is a good idea for imbalanced class problems.
3.Precision and recall metrics are good for imbalance class problems.
4.Precision and recall metrics aren’t good for imbalanced class problems"
1 and 3
1 and 4
2 and 3
2 and 4 A
34. "In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of
these models will give a better prediction than prediction of individual models.Which of the
following statements is / are true for weak learners used in ensemble model?
1.They don’t usually overfit.
2.They have high bias, so they cannot solve complex learning problems
3.They usually overfit.
1 and 2 1 and 3 2 and 3 Only 1 A
35. "Which of the following options is/are true for K-fold cross-validation? 1.Increase in K will result in
higher time required to cross validate the result. 2.Higher values of K will result in higher confidence on
the cross-validation result as compared to lower value of K. 3.If K=N, then it is called Leave one out
cross validation, where N is the number of observations.
1 and 2
2 and 3
1 and 3
1,2 and 3 d
36.What would you do in PCA to get the same projection as SVD?
Transform data to zero mean
Transform data to zero median
Not possible
None of these A
37."It is possible to construct a k-NN classification algorithm based on this black box alone.
Note: Where n (number of training observations) is very large compared to k.
TRUE FALSE A
38. "Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black box.
Which of the following option is correct for finding k-NN using j-NN?1.J must be a proper
factor of k. 2.J > k.3.Not possible"
1 2 3 4 A
39.Which of the following value of K will have least leave-one-out cross validation accuracy?
1NN
3NN
4NN
All have same leave one out error A
40."Suppose we have a dataset which can be trained with 100% accuracy with help of a
decision tree of depth 6. Now consider the points below and choose the option based
on these points.Note: All other hyper parameters are same and other factors are not affected.
1.Depth 4 will have high bias and low variance 2.Depth 4 will have low bias and low
variance
"
Only 1
Only 2
Both 1 and 2
None of the above A
41 "Which of the following options can be used to get global minima in k-Means Algorithm? 1.Try to run
algorithm for different centroid initialization 2.Adjust number of iterations 3.Find out the optimal number
of clusters"
2 and 3
1 and 3
1 and 2
All of above D
42 "For which of the following hyper parameters, higher value is better for decision tree algorithm?
1.Number of samples used for split 2.Depth of tree 3.Samples for leaf"
1 and 2
2 and 3
1 and 3
Can’t say D
43 What is the dimension of output feature map when you are using the given parameters.
28 width, 28 height and 8 depth

13 width, 28 height and 8 depth A
44 What is the dimensions of output feature map when you are using following parameters.
28 width,28 height and 8 depth
13 width, 28 height and 8 depth B
45. k-NN algorithm does more computation on test time rather than train time.
TRUE FALSE A
46. Which of the following option is true about k-NN algorithm?
It can be used for classification
It can be used for regression
It can be used in both classification and regression
none of these C
47. "Which of the following statement is true about k-NN algorithm?1.k-NN performs much better if all
of the data have the same scale 2.k-NN works well with a small number of input variables (p),but
struggles when the number of inputs is very large3.k-NN makes no assumptions about the functional form
of the problem being solved"
1 and 2
1 and 3
Only 1
All of the above D
48. Which of the following machine learning algorithm can be used for imputing missing values of both
categorical and continuous variables?
K-NN
Linear Regression
Logistic Regression A
49. Which of the following is true about Manhattan distance?
It can be used for continuous variables
It can be used for categorical variables
It can be used for categorical as well as continuous
None of these A
50. "Which of the following distance measure do we use in case of categorical variables in k-NN?
1.Hamming Distance 2.Euclidean Distance3.Manhattan Distance"
1 2 3 4 A
51.Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
1 2 4 8 A
52.Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?
1 2 4 8 A
53.Which of the following will be true about k in k-NN in terms of Bias?
When you increase the k the bias will be increases
When you decrease the k the bias will be increases
Can’t say
None of these A
54.Which of the following will be true about k in k-NN in terms of variance?
When you increase the k the variance will increases
When you decrease the k the variance will increases
Can’t say
None of these B
55.When you find noise in data which of the following option would you consider in k-NN?
I will increase the value of k
I will decrease the value of k
Noise cannot be dependent on value of k
None of these A
56."In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option
would you consider to handle such problem? 1.Dimensionality Reduction 2.Feature selection"
1 2 1 AND 2 NONE OF THESE C
57. "Which of the following is/are true about Random Forest and Gradient Boosting ensemble methods?
1.Both methods can be used for classification task 2.Random Forest is use for classification whereas
Gradient Boosting is use for regression task 3.Random Forest is use for regression whereas Gradient
Boosting is use for Classification task 4.Both methods can be used for regression task"
1 2 4 1 AND 4 D
58. Which of the following algorithm are not an example of ensemble learning algorithm?
Random Forest
Decision Trees
Extra Trees
Gradient Boosting B
59."Suppose you are using a bagging based algorithm say a Random Forest in model building. Which of
the following can be true? 1. Number of tree should be as large as possible 2.You will have
interpretability after using Random Forest"
1 2 1 AND 2 NONE OF THESE A
60. A _________ is a decision support tool that uses a tree-like graph or model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility.
Decision tree
Graphs
Trees
Neural Networks A
61. Which of the following are the advantage/s of Decision Trees?
Possible Scenarios can be added
Use a white box model, If given result is provided by a model
Worst, best and expected values can be determined for different scenarios
All of the mentioned D
62. Decision Trees can be used for Classification Tasks.

TRUE FALSE A
63.Which is true for neural networks?
It has set of nodes and connections
Each node computes it’s weighted input
Node could be in excited state or non-excited state
All of the mentioned D
64."Which of the following is true for neural networks? (i) The training time depends on the size of the
network.(ii) Neural networks can be simulated on a conventional computer.(iii) Artificial neurons are
identical in operation to biological ones."
All of the mentioned
(ii) is true
(i) and (ii) are true
None of the mentioned C
65 Which algorithm is used for solving temporal probabilistic reasoning?
Hill-climbing search
Hidden Markov model
Depth-first search
Breadth-first search B
66 How does the state of the process is described in HMM?
Literal
Single random variable
Single discrete random variable
None of the mentioned C
67 Where does the additional variables are added in HMM?
Temporal model
Reality model
Probability model
All of the mentioned A
68.Which of the following is a representation learning algorithm?
Neural network
Random Forest
k-Nearest neighbor
None of the above A
69. Increase in size of a convolutional kernel would necessarily increase the performance of a
convolutional neural network.
TRUE FALSE B
70. Which of the following categories would be suitable for this type of problem?
Fine tune only the last couple of layers and change the last layer (classification layer) to
regression layer
Freeze all the layers except the last, re-train the last layer
Re-train the model for the new dataset
None of these A
71. Suppose you have 5 convolutional kernel of size 7 x 7 with zero padding and stride 1 in the first layer
of a convolutional neural network.You pass an input of dimension 224 x 224 x 3 through this layer. What
are the dimensions of the data which the next layer will receive?
217 x 217 x 3
217 x 217 x 8
218 x 218 x 5
220 x 220 x 7 C
72. "Suppose we have a neural network with ReLU activation function. Let’s say, we replace
ReLu activations by linear activations.Would this new neural network be able to approximate an XNOR
function? Note: The neural network was able to approximate XNOR function with activation function
ReLu."
YES NO B
73. "Which of the following is a data augmentation technique used in image recognition tasks?
1.Horizontal flipping 2.Random cropping 3.Random scaling. 4.Color jittering 5.Random
translation.6.Random shearing
1, 2, 4
2, 3, 4, 5, 6
1, 3, 5, 6
All of these D
74. "Given an n-character word, we want to predict which character would be the n+1th character in the
sequence. For example, our input is “predictio”(which is a 9-character word) and we have to predict what
would be the 10th character.Which neural network architecture would be suitable to complete this task?"
Fully-Connected Neural Network
Convolutional Neural Network
Recurrent Neural Network
Restricted Boltzmann Machine C
75.What is generally the sequence followed when building a neural network architecture for semantic
segmentation for image?
Convolutional network on input and deconvolutional network on output
Deconvolutional network on input and convolutional network on output

A
76. What is the technical difference between vanilla back propagation algorithm and back propagation
through time (BPTT) algorithm?
Unlike backprop, in BPTT we sum up gradients for corresponding weight for

each time step
Unlike backprop, in BPTT we subtract gradients for corresponding weight for each
time step A
77. "Exploding gradient problem is an issue in training deep networks where the gradient gets so large
that the loss goes to an infinitely high value and then explodes. What is the probable approach when
dealing with “Exploding Gradient” problem in RNNs?"
Use modified architectures like LSTM and GRUs
Gradient clipping
Dropout
None of these B
78.Which of the following is not a direct prediction technique for NLP tasks?
Skip-gram model
PCA
Convolutional neural network C
79. Back propagation works by first calculating the gradient of ___ and then propagating it backwards.
Sum of squared error with respect to inputs
Sum of squared error with respect to weights
Sum of squared error with respect to outputs
None of the above C
80. A recurrent neural network can be unfolded into a full-connected neural network with infinite length
TRUE FALSE A
81. It is generally recommended to replace pooling layers in generator part of convolutional generative
adversarial nets with ________ ?
Affine layer
Strided convolutional layer
Fractional strided convolutional layer
ReLU layer C
82. In a neural network, knowing the weight and bias of each neuron is the most important step. If you
can somehow get the correct value of weight and bias for each neuron, you can approximate any function.
What would be the best way to approach this?
Assign random values and pray to God they are correct
Search every possible combination of weights and biases till you get the best value
Iteratively check that after assigning a value how far you are from the best values,
and slightly change, the assigned values ,values to make them better
None of these C
83. "What are the steps for using a gradient descent algorithm? 1.Calculate error between the actual value
and the predicted value 2.Reiterate until you find the best weights of network 3.Pass an input through the
network and get values from output layer 4.Initialize random weight and bias 5.Go to each neurons
which contributes to the error and change its respective values to reduce the error"
1, 2, 3, 4, 5
5, 4, 3, 2, 1
3, 2, 1, 5, 4
4, 3, 1, 5, 2 D
84.“Convolutional Neural Networks can perform various types of transformation (rotations or scaling) in
an input”. Is the statement correct True or False?
TRUE FALSE B
85 Which of the following techniques perform similar operations as dropout in a neural network?
Bagging
Boosting
Stacking
None of these A
86 Which of the following gives non-linearity to a neural network?
Stochastic Gradient Descent
Rectified Linear Unit
Convolution function
None of the above B
87. "What is the sequence of the following tasks in a perceptron? 1.Initialize weights of perceptron
randomly. 2.Go to the next batch of dataset. 3.If the prediction does not match the output, change the
weights 4.For a sample input, compute an output"
1, 2, 3, 4
4, 3, 2, 1
3, 1, 2, 4
1, 4, 3, 2 D
88 Can a neural network model the function (y=1/x)?
YES NO A
89 In which neural net architecture, does weight sharing occur?
Convolutional neural Network

Fully Connected Neural Network
Both A and B D
90. The number of neurons in the output layer should match the number of classes (Where the number of
classes is greater than 2) in a supervised learning task. True or False?
TRUE FALSE B
91.In a neural network, which of the following techniques is used to deal with overfitting?
Dropout
Regularization
Batch Normalization
All of these D
92. "Y = ax^2 + bx + c (polynomial equation of degree 2) Can this equation be represented by a neural
network of single hidden layer with linear threshold?"
YES NO B
93. What is a dead unit in a neural network?
A unit which doesn’t update during training by any of its neighbour
A unit which does not respond completely to any of the training patterns
The unit which produces the biggest sum-squared error
None of these A
94. Which of the following statement is the best description of early stopping?
Train the network until a local minimum in the error function is reached
Simulate the network on a test dataset after every epoch of training. Stop training when the
generalization error starts to increase
Add a momentum term to the weight update in the Generalized Delta Rule, so that training
converges more quickly
A faster version of backpropagation, such as the `Quickprop’ algorithm B
95. What if we use a learning rate that’s too large?
Network will converge

Network will not converge
BOTH
Can’t Say B
96. Suppose a convolutional neural network is trained on ImageNet dataset (Object recognition dataset).
This trained model is then given a completely white image as an input . The output probabilities for this
input would be equal for all classes. True or False?
TRUE FALSE B
97. When pooling layer is added in a convolutional neural network, translation in-variance is preserved.
True or False?
TRUE FALSE A
98. Which gradient technique is more advantageous when the data is too big to handle in
RAM simultaneously?
Full Batch Gradient Descent
Stochastic Gradient Descent B
99.For a classification task, instead of random weight initializations in a neural network, we set all the
weights to zero. Which of the following statements is true?
There will not be any problem and the neural network will train properly
The neural network will train but all the neurons will end up recognizing the same
thing
The neural network will not train as there is no net gradient change
None of these B
100 For an image recognition problem (recognizing a cat in a photo), which architecture of
neural network would be better suited to solve the problem?
Multi- Layer Perceptron
Convolutional Neural Network
Recurrent Neural network
Perceptron B
101."What are the factors to select the depth of neural network? 1.Type of neural network (eg. MLP
CNN etc) 2.Input data 3.Computation power, i.e. Hardware capabilities and software capabilities
4.Learning Rate 5.The output function to map"
1, 2, 4, 5
2, 3, 4, 5
1, 3, 4, 5
All of these D
102. Consider the scenario. The problem you are trying to solve has a small amount of data.
Fortunately, you have a pre-trained neural network that was trained on a similar problem. Which of the
following methodologies would you choose to make use of this pre-trained network?
Re-train the model for the new dataset
Assess on every layer how the model performs and only select a few of them
Fine tune the last couple of layers only
Freeze all the layers except the last, re-train the last layer D
103 Increase in size of a convolutional kernel would necessarily increase the performance of a
convolutional network
TRUE FALSE B
104 Which of the following are universal approximators?
Kernel SVM
Neural Networks
Boosted Decision Trees
All of the above D
105 In which of the following applications can we use deep learning to solve the problem?
Protein structure prediction
Prediction of chemical reactions
Detection of exotic particles
All of these D
106 Which of the following statements is true when you use 1×1 convolutions in a CNN?
It can help in dimensionality reduction
It can be used for feature pooling

It suffers less overfitting due to small kernel size
All of the above D
107. "Which of the statements given above is true?
Statement 1: It is possible to train a network well by initializing all the weights as 0
Statement 2: It is possible to train a network well by initializing biases as 0"
Statement 1 is true while Statement 2 is false
Statement 2 is true while statement 1 is false
Both statements are true
Both statements are false B
108 The number of nodes in the input layer is 10 and the hidden layer is 5.
The maximum number of connections from the input layer to the hidden layer are
50
Less than 50
More than 50
It is an arbitrary value A
109 The input image has been converted into a matrix of size 28 X 28 and a kernel/filter
of size 7 X 7 with a stride of 1. What will be the size of the convoluted matrix?
22 X 22
21 X 21
28 X 28
7X7 A
110 In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden
layer and 1 neuron in the output layer. What is the size of the weight matrices between
hidden output layer and input hidden layer?
[1 X 5] , [5 X 8]
[8 X 5] , [ 1 X 5]
[8 X 5] , [5 X 1]
[5 x 1] , [8 X 5] D
111.Which of the following functions can be used as an activation function in the output layer
if we wish to predict the probabilities of n classes (p1, p2..pk) such that sum of p over all n equals to 1?
Softmax
ReLu
Sigmoid
Tanh A
112. Assume a simple MLP model with 3 neurons and inputs= 1,2,3. The weights to the input
neurons are 4,5 and 6 respectively. Assume the activation function is a linear constant
value of 3. What will be the output ?
32 643 96 48 C
113. Which of following activation function can’t be used at output layer to classify an image
Sigmoid Tanh ReLU If(x>5,1,0) C
114 In the neural network, every parameter can have their different learning rate.
TRUE FALSE A
115 Dropout can be applied at visible layer of Neural Network model?
TRUE FALSE A
116 Which of the following neural network training challenge can be solved using batch
normalization?
Overfitting
Restrict activations to become too high or low
Training is too slow
Both B and C D
117 Which of the following would have a constant input in each epoch of training a Deep Learning
model?
Weight between input and hidden layer

Weight between hidden and output layer
Biases of all hidden layer neurons
Activation function of output layer A
118 Changing Sigmoid activation to ReLu will help to get over the vanishing gradient issue?
TRUE FALSE A
119 In CNN, having max pooling always decrease the parameters?
TRUE FALSE B
120 BackPropogation cannot be applied when using pooling layers
TRUE FALSE B
121 Suppose there is an issue while training a neural network. The training loss/validation loss
remains constant. What could be the possible reason?
Architecture is not defined correctly
Data given to the model is noisy
Both of these
NONE C
122. "Which of the following statement is true regrading dropout?
1: Dropout gives a way to approximate by combining many different architectures
2: Dropout demands high learning rate 3: Dropout can help preventing overfitting"
Both 1 and 2
Both 1 and 3
Both 2 and 3
All 1, 2 and 3 B
123 Gated Recurrent units can help prevent vanishing gradient problem in RNN.
TRUE FALSE A
124 What steps can we take to prevent overfitting in a Neural Network?
Data Augmentation
Weight Sharing
Early Stopping
All of the above D
125 What do you mean by generalization error in terms of the SVM?
How far the hyperplane is from the support vectors
How accurately the SVM can predict outcomes for unseen data
The threshold amount of error in an SVM B
126 When the C parameter is set to infinite, which of the following holds true?
The optimal hyperplane if exists, will be the one that completely separates the data
The soft-margin classifier will separate the data
None of the above A
127 What do you mean by a hard margin?
The SVM allows very low error in classification
The SVM allows high amount of error in classification
None of the above A
128 The minimum time complexity for training an SVM is O(n2). According to this fact,
what sizes of datasets are not best suited for SVM’s?
Large datasets
Small datasets
Medium sized datasets
Size does not matter A
129 The effectiveness of an SVM depends upon:
Selection of Kernel
Kernel Parameters
Soft Margin Parameter C
All of the above D

130 Support vectors are the data points that lie closest to the decision surface.
TRUE FALSE A
131 The SVM’s are less effective when:
The data is linearly separable
The data is clean and ready to use
The data is noisy and contains overlapping points C
132 Suppose you are using RBF kernel in SVM with high Gamma value. What does this signify?
The model would consider even far away points from hyperplane for modeling
The model would consider only the points close to the hyperplane for modeling
The model would not be affected by distance of points from hyperplane for modeling
None of the above B
133 Which of the following are real world applications of the SVM?
Text and Hypertext Categorization
Image Classification
Clustering of News Articles
All of the above D
134 Which algorithm is lazy algorithm
KNN
K Means
Support Vectors machines
Random Forest A
135 Different learning methods does not include?
Memorization
Analogy
Deduction
Introduction D
136 Which of the following is an example of a deterministic algorithm?
PCA K-Means Support Vectors machines KNN A
137 Another Name for output attribute is
Predictor variable
Independent variable
Response variable
dependent vairable A
138 Which of the following can be used to impute data sets based only on information in the training
set. ?
postProcess
preProcess
process
All of the Mentioned B
139 Which of the following is a categorical outcome?
RMSE
RSquared
Accuracy
All of the Mentioned C
140 Which of the following function provides unsupervised prediction ?
cl_forecast
cl_nowcast
cl_precast
None of the Mentioned D
141 Which algorithm is used for small and large data sets
SVM
RF
NAÏVE BAYES
DECISION TREES A
142 Which algorithm forms a Blend of trees
RF
SVM
Decsion Trees
KNN A
143 Which type of algorithms is used for statistical analysis
Classification
Clustering
Regression
Association C
144 Which type learning deala with environment
Supervised
Unsupervised
Reinforcement
None C
145 In which type of learning the input andpredicted output is given for training data
Supervised
Unsupervised
Reinforcement
None A
146 In which type of algorithms the distance between two points is uesd for identying neighbour
RF SVM Decsion Trees KNN D
147 Which algorithm is referred as CART
RF SVM Decsion Trees KNN C
148 Which algorithm involve post and prior probabilities

RF SVM Decsion Trees Naïve Bayes D
149 Which algorithm has Kernel based features
RF SVM Decsion Trees Naïve Bayes B
150 Which algorithm is subjected to over fitting problem
RF SVM Decsion Trees Naïve Bayes C

Machine Learning Bits

Uploaded by

Copyright:

Available Formats

Machine Learning Bits

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Bits

Uploaded by

Copyright:

Available Formats

1.

The value of the gradient at extreme a of a function is always zero

Depends on the type of problem

None of the above A

Area under the ROC curve

All of the above D

4. Which of the following is a good test dataset characteristic?

Large enough to yield meaningful results

Is representative of the dataset as a whole

None of the above C

5. Which of the following is a disadvantage of decision trees?

Decision trees are robust to outliers

Decision trees are prone to be overfit

None of the above C

6. How do you handle missing or corrupted data in a dataset?

Replace missing values with mean/median/mode

Assign a unique category to missing values

All of the above D

7. "What is the purpose of performing cross-validation"

To assess the predictive performance of the models

None of the above C

8. Why is second order differencing in time series needed?

To find the maxima or minima at the local point

None of the above C

Normalize the data ? PCA ? training

PCA ? normalize PCA output ? training

Normalize the data ? PCA ? normalize PCA output ? training

None of the above A

10. Which of the following is an example of feature extraction?

Constructing bag of words vector from an email

Applying PCA projects to a large high-dimensional data

Removing stop words in a sentence

All of the above D

11. What is pca components in Sklearn?

Set of all eigen vectors for the projection space

Result of the multiplication matrix

None of the above options A

12. Which of the following is true about Naive Bayes ?

Assumes that all the features in a dataset are equally important

Assumes that all the features in a dataset are independent

None of the above options C

13. Which of the following statements about regularization is not correct?

None of the above D

Set the same seed value for each run

Use multiple random initializations

None of the above B

Stop Word Removal

Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).

Choose k to be the largest value so that 99% of the variance is retained.

Use the elbow method A

a=0.3 is an effective choice of learning rate

None of the above C

19. What is a sentence parser typically used for?

It is used to parse sentences to check if they are utf-8 compliant.

It is used to parse sentences to assign POS tags to all tokens.

It is used to check if sentences can be parsed into meaningful tokens. B

Our estimate for P (y=1 | x)

Our estimate for P (y=0 | x)

Our estimate for P (y=1 | x)