- Categorical Hinge Loss This loss function is inspired from a variation of hinge loss proposed for multi-class classification (Dogan et al., 2016; Moore & DeNero, 2011) and is Learning Choice Functions used only for the discrete choice setting. It upper bounds the categorical 0/1-loss and is defined as: dCH(y, s) = max 1 + max (i,j∈I):yj =1,yi=0 (si − sj), 0 This loss basically takes the maximum difference between the score sj of chosen object yj = 1 and score si of other objects i ∈ I \ j in Q. So, it the score of any objects which are not chosen is greater than the score of the chosen object si > sj then it results in high loss value as shown in Figure 5. We use this loss function over categorical crossentropy because it not only penalizes if the predicted score is low but also accounts for margin to the scores of other objects in the given choice task Q.
- Comparison approaches In order to compare our proposed neural network based choice models FATE-NET 3 https://github.com/kiudee/cs-ranking and FETA-NET to an independent latent scoring model, we adapt the ranking algorithm RANKNET which was proposed for solving the task of object ranking using the underlying pairwise preferences (Burges et al., 2005; Tesauro, 1989).
- Figure 4b shows the attraction effect. In this case, another asymmetrically dominant object C is added to the existing set of objects {A, B}, where B slightly dominates a, then the relative utility share for object B increases in regards with A. The primary psychological reason is that consumers prefer the dominating products out of a set (Huber & Puto, 1983). Overall the consumer choice might change from A to B on adding another alternative to the set.
- For the choice setting the metric is calculated by comparing the ground-truth choice set c(Q) in binary vector form y for the given choice task Q = {x1, . . . , xn}, with predicted choice set ĉ(Q) in binary vector form ŷ and the metrics are defined in form d(y, ŷ) (|Q|= |y|= n). To define the metrics further we have to define the four quantities which are similar to those used to define the confusion matrix in case of binary classification i.e., true positives, true negatives, false positives, and false negatives (Koyejo et al., 2015). Formally they are defined as: d TP(y, ŷ) = 1 n n X i=1 Jyi = 1, ŷi = 1K d TN(y, ŷ) = 1 n n X i=1 Jyi = 0, ŷi = 0K d FP(y, ŷ) = 1 n n X i=1 Jyi = 1, ŷi = 0K d FN(y, ŷ) = 1 n n
- Hyperparameters & Inference For all neural network models, we make use of the following techniques: • We use either rectified linear units (ReLU) nonlinearities + batch normalization (BN) (Ioffe & Szegedy, 2015) or self-normalizing linear units (SeLU) non-linearities (Klambauer et al., 2017) for each hidden layer. • Regularization: L2 penalties are applied and the corresponding regularization strength is tuned.
- The important difference between the multi-label classification and the choice function setting is that there are no fixed labels. That is why we can only use micro-averaging to compute the F1-measure across different objects and instances (Koyejo et al., 2015).
- The most common GEV models which are used for conjoint analysis studies in the field of market research are the NESTEDLOGIT and GENNESTEDLOGIT, which account for the similarity context-effect (Ben-Akiva et al., 1985; Tversky, 1972). These models allocate the objects in the given choice task Q, into different sets called nests and learn correlations between the objects inside each nest (B = {B1, . . . BK}) (Wen & Koppelman, 2001; Train, 2009). The GENNESTEDLOGIT is the most general model of this class, which allows the fractional allocation of each object in Q to each nest and learns the correlation between them (Wen & Koppelman, 2001). Another model which was proposed for solving the task of discrete choice is the PAIRWISESVM. It makes use of the underlying pairwise preferences to fit a linear model.
- The similarity effect is another phenomenon according to which the presence of one or more similar objects reduces their overall probability of getting chosen, as it divides the loyalty of potential consumers (Huber & Puto, 1983). In Figure 4c, B and C are two similar objects. Consumers who prefer high quality will be divided amongst the two objects resulting in a decrease of the relative utility share of object B. While in the original set, the choice of these customers will always be B, while on adding another object C similar to B can change the overall choice to A. îˆality Price B A C (a) Compromise îˆality Price B A C (b) Attraction îˆality Price B A C
- The step-decay function drops the learning rate by a factor after a few epochs (Duchi et al., 2011). The intuition behind this function is that to traverse to proper parameters and then reduce the learning rate to narrower parts of the loss function. Formally it is defined as: lr = lr0 ∗ d e edrop r , where lr0 is the initial learning rate, 0 < dr < 1 is the rate with which the learning rate should be reduced, e is the current epoch and edrop is the number of epochs after which the learning rate is decreased.
- Top-k Categorical Accuracy The top-k categorical accuracy is defined as the fraction of times in which the set of objects in the top k positions, according to the predicted scores, contains the ground-truth chosen object (Chollet et al., 2017; Ben-Akiva et al., 1985). Let r↓:= arg sorti∈|Q| si denote the indexes of the score vector s when sorted in decreasing order. Then the top-k categorical accuracy is defined as dtopK(c(Q), s) = s c(Q) ⊂ k [ i=1 xr↓i { .
- TP + d FN Precision Precision denotes the proportion of predicted positive labels that are correct (Powers, 2011). For the choice setting this can be defined as the fraction of objects from the predicted choice set ĉ(Q) that are actually chosen by the decision maker or that are present in the ground-truth choice set c(Q). Formally it is defined as: dPR = d TP d
- TP + d FP F1-measure The traditional F1-measure is defined as the harmonic mean of precision and recall: dF1 (y, ŷ) = 2 dPR dRE dPR + dRE We can also define in form of the confusion matrix quantities as follows (Koyejo et al., 2015): dF1 (y, ŷ) = 2d TP 2d TP + d FN + d FP Learning Choice Functions A.5. Discrete Choice Function Metrics We evaluate the DCMs based on top-k categorical accuracy, while the models are compared on discrete choice tasks with different sizes based on the normalized accuracy. In discrete choice setting the metric is calculated by comparing the ground-truth choice set/discrete choice c(Q) for the given discrete choice task Q = {x1, . . . , xn}, with vector s = (s1, . . . , sn) of predicted scores for each object in Q and the metrics are defined in form d(c(Q), s).
- X i=1 Jyi = 0, ŷi = 1K Subset 0/1 Accuracy Subset 0/1 accuracy measures the number of times the ground-truth choice set c(Q) and the predicted choice set ĉ(Q) are exactly the same. This metric is used to measure how often the algorithms predictions match the complete choice set. Formally it is defined as: dSUBSET = Jy = ŷK Recall Recall is defined as the proportion of Real Positive cases that are correctly Predicted Positive (Powers, 2011). In the field of information retrieval, it is the fraction of the relevant documents that are successfully retrieved. For choice setting this can be defined as the fraction of objects from the ground-truth choice set c(Q) which chosen successfully or are present in the predicted choice set ĉ(Q). Formally it is defined as: dRE = d TP d
- X i=1 yi log si , The loss increases as the predicted scores si diverges for the chosen object yi = 1, yi ∈ y (Murphy, 2012). So, predicting a score of 0.012 for the chosen object i ∈ I would result in a high value for loss, and a perfect model would have a log loss of 0 as shown in Figure 5.
