Major 2020
Major 2020
Major 2020
Q1. Calculate the cosine, correlation, Jaccard, and Extended Jaccard similarity/distance for the
vectors x = (1, 1, 0, 1, 0, 1) and y = (1, 1, 1, 0, 0, 1). [2]
Q2. (a) How does an ordinal feature differ from a nominal feature? Explain briefly. [1]
Q3. We will use the dataset below to learn a decision tree which predicts if people pass machine
learning (Yes or No), based on their previous GPA (High, Medium, or Low) and whether or not they
studied.
Q4. You are given the transaction data shown in the Table below. There are 9 distinct transactions
(order:1 – order:9) and each transaction involves between 2 and 4 items.
a. Apply the Apriori algorithm to the dataset of transactions and identify all frequent k-
itemsets. Show all of your work.
b. Find all strong association rules of the form: X ∧ Y à Z and note their confidence values.
c. Construct the FP-tree corresponding to the set of transactions. Show all steps involved.
d. Mine the FP-tree according to the FP-growth algorithm. Show all steps. The results should
include the set of frequent patterns generated through the different steps in the analysis.
Q-5) Assume that you have to explore a large data set of high dimensionality. You know nothing
about the distribution of the data. In text of no more than one page, discuss the following.
(a) How can k-means and DBSCAN be used to find the number of clusters in that data? [1]
(b) Explain how PCA can help find the dimensions where clusters separate. [1]
(c) Explain why PCA might neglect cluster separation in some dimensions. [1]
(d) Can k-means or DBSCAN be applied in a way that would help you find the dimensions in
which the clusters separate? [1]
Q6. (a) State one advantage and one disadvantage of DBSCAN clustering algorithm along with brief
explanation. [1]
(b) Suppose you are already given an optimal value for the ‘Minimum Points’ in DBSCAN clustering
algorithm. Explain how it can be used to find an optimal value for the distance ‘Eps’ used in the
DBSCAN clustering algorithm. [1]
(c ) Describe unsupervised cluster evaluation method which uses correlation and similarity matrix,
as discussed in the class. How that method can be used for validating the clusters obtained? What is
the intuition behind this evaluation measure? [1]
1 0.9
Q7. Given the correlation matrix 𝑅 = % * [1.5+1+1.5]
0.9 1
(a) Compute the eigenvalues λ1 and λ 2 of R and the corresponding eigenvectors γ1 and γ2 of R.
(b) Compute the weights of the principal components C1 and C2 that sets the scales of the
components and ensures that they are orthogonal.
(c) What proportion of the total variance in the data does the first principal component account
for?
Q8. (a) Discuss the differences between dimensionality reduction based on aggregation and dimensionality
reduction based on PCA. [1]
(b) Many statistical tests for outliers were developed in an environment in which a few
hundred observations was a large data set. We explore the limitations of such
approaches.
(1) For a set of 1,000,000 values, how likely are we to have outliers according to the
test that says a value is an outlier if it is more than three standard deviations from
the average? (Assume a normal distribution.) [1.5]
(2) Does the approach that states an outlier is an object of unusually low probability
need to be adjusted when dealing with large data sets? If so, how? [1.5]