Abstract
Inefficient interaction such as long and/or repetitive questionnaires can be detrimental to user experience, which leads us to investigate the computation of an intelligent questionnaire for a prediction task. Given time and budget constraints (maximum q questions asked), this questionnaire will select adaptively the question sequence based on answers already given. Several use-cases with increased user and customer experience are given.
The problem is framed as a Markov Decision Process and solved numerically with approximate dynamic programming, exploiting the hierarchical and episodic structure of the problem. The approach, evaluated on toy models and classic supervised learning datasets, outperforms two baselines: a decision tree with budget constraint and a model with q best features systematically asked. The online problem, quite critical for deployment seems to pose no particular issue, under the right exploration strategy.
This setting is quite flexible and can incorporate easily initial available data and grouped questions.
1 Introduction
In user interaction, less is often more. When asking questions, it is desirable to ask as few questions as possible or given a budget of questions asking the most interesting ones. We study the case of intelligent questionnaires in which the questions asked may depend on the previous answers. More precisely, we consider a set of p questions in a prediction context. Given a budget of q questions, we design an algorithm choosing sequentially the next question to be asked so that the predictive power is maximized after having q answers. We assume that we have observed the whole set of answers on a first dataset and that no prior knowledge is available.
This setting is quite general and comprises for example:
Patient follow-up. Consider a patient who is hospitalized at home and fills in a daily checkup questionnaire asking for leg pain, a hurting chest or physical discomfort. The aim of the questionnaire being to check the status of the patient, we would like to ask the most pertinent questions as soon as possible and personalize the questions to their status.
Prospective calls e. g. telemarketing. Instead of bluntly unrolling the same list of questions, we could adapt our series of questions in order to know as fast as possible if the person called would be interested or not in our product.
Cold start issue with new customers. When subscribing to a new service, it is not uncommon to get asked some questions in order to personalize the service e. g. Netflix, web service provider. Assuming we already have a customer clustering at hand, we would like to find the new customer’s cluster with as seamlessly as possible. One way to do so is to ask very few questions.
Balancing acquisition cost with available information. Assuming the data is paid for, e. g. personal data sold, we would choose which information to pay for each people.
Storing less data to make as good predictions. Considering that data storage has non-negligible cost, whether it be financial, facility-wise or environmental, we wish to only keep data which is essential to the prediction. One way to do so is to store a sparse matrix.
Such a sequence of questions depending on the previous answers has a tree-like structure. A classical CART algorithm [6] with a tree of depth q provides a solution but is optimized in a top-down manner whereas we propose a bottom-up optimization. Furthermore, we allow much more flexibility than a single coordinate thresholding to choose the next question, or than association rules.
We formulate this problem as a sequential decision-making problem and represent it by a Markov Decision Process [16] where the state is the information currently available, actions are the questions we ask and as final rewards the prediction performance based on the final partial information. Take a look at Figure 1 to have an idea of the expected result. Such a modeling has been used for instance in [7] to tackle the game of 20 questions relying on a smart matrix factorization. This setting has also been used in active learning [15] and more recently in a health diagnosis problem using a reinforcement learning approach [5].
In our setting, we assume we have the full set of answers in a first dataset, hence we do not have an exploration issue and can thus focus on a planning approach [11]. We show how to use approximate dynamic programming [3], [4] to propose this adaptive sequence of questions so that it outperforms a fixed subset of q questions or a depth q CART decision tree. Several toy models and benchmark datasets will be used as means to validate our approach.
We start by presenting the methodology in section 2, followed by experimental results on toy models and benchmark datasets in section 3. In section 4 we investigate the possibility of bringing this methodology online, by testing it on one of the toy models. A general conclusion as well as ideas for improvements are presented in section 5.
2 Methodology
2.1 Setting
Consider
We aim at designing an Intelligent Questionnaire algorithm
where the function
We propose the following score function:
where
2.2 Markov Decision Process
The questionnaire process can be modeled by a Markov Decision Process (MDP, [16])
Finally, R is the reward function defined for any
For feature vectors with less than q elements, the reward function equals 0.
The MDP starts with initial state
In one of our experiments (Coronary Heart Disease dataset), we extend the setting to the case where initial information is available, allowing us to personalize the initial question. We also consider that to a given action, multiple features may be revealed. Other extensions are discussed in the last section.
Please note that it is straightforward to consider the case where initial information is known, and we still would like to reveal q new informations. Also, one might consider a specific action-question interaction, for instance: one might consider that some action reveal groups of features, because those are somewhat related and packed together during the interaction. To do so, one would only need to specify a binary matrix indicating the relationship between an action and the feature revealed: 1 if it is, 0 otherwise. We applied this formalism to one of the benchmark datasets, see Figure 2, for which we considered that gender information was available at the beginning and we considered that action 6 consisted in a short heart rate monitoring, hence providing us directly with both systolic and diastolic blood pressure levels.
2.3 Proposed Solution
We assume that we have at our disposal the set
The problem being episodic (q steps) and hierarchical, we can use approximate dynamic programming [4] to learn the value functions in a backward fashion as presented in Figure 1.
Based on the calibrated neural networks
The objective at the end is to get something resembling the decision-making procedure presented in Figure 1.
2.4 Online Setting
Let us now assume that we are going to collect questionnaires in a smart way i. e. starting with no data and with the constraint on the number of questions authorized. Specifically, we assume that this process starts with a run-in period where
Let us now propose candidate strategies for
with control parameter β,
Another policy is the Upper-Confidence Bound (UCB) strategy which suggests taking:
where
3 Experiments
To evaluate the methodology presented above, we used three toy models we built as well as three standard supervised learning benchmark datasets. The toy models were built in order to ensure that the methodology achieves proper performances against baselines and therefore validate quantitatively its interest. Amongst the three datasets we considered, there is the Boston Housing, the AMES and the Coronary Heart Disease (CHD) dataset. The first two, although not practically realistic for the intended use, serve as quantitative evaluation of the methodology performance on real-life data. The CHD dataset however is quite close to the motivation of this paper.
Amongst the six problems, four of them were regression problems, evaluated with Root Mean Squared Error (RMSE) metric. The two others were binary classification problems, evaluated with Area Under the Curve (AUC) metric.
For each of those problems, we built an Intelligent Questionnaire algorithm with a budget of
3.1 Toy Models
Three toy models were considered in order to test our approach. In each case we simulated in total 6000 samples, 67 % of which are used for training and the rest for testing. As predictor functions, we used random forests with 100 trees, as implemented in the R package randomForest. For models #2 and #3, we will write ε a standard Gaussian noise generated independently of features X. For model #1 we considered the inaccuracy score function and for models #2 and #3 we used the squared prediction error.
3.1.1 Model #1, Set of Rules with Binary Features
We consider
Let
We define
3.1.2 Model #2, Set of Rules with Binary and Continuous Features
We consider
From there
with
In this toy model we add some stochasticity in the target and we consider mixed-type features.
3.1.3 Model #3: Regression with Continuous Features
We consider
From there,
In this model, the target is a linear regression of the covariates, whose parameters depend on whether the first coordinate is strictly positive or not.
3.2 Benchmark Datasets
Three benchmark datasets where considered: the Boston Housing dataset [10], the more recent AMES dataset [8] and the Coronary Heart Disease dataset [2], [1]. The first two contain house prices and characteristics jointly. Because of the relatively low sample sizes we relied on linear regression models as prediction functions, rather than non-parametric models. For the Coronary Heart Disease problem, having relatively large sample size we used extreme gradient boosting with validation split for early stopping, relying on R package xgboost.
3.2.1 Boston Housing Dataset
This dataset contains 506 observations (one per suburb) and 13 variables, amongst which: the median value of owner-occupied homes (the target), crime rate, average number of rooms, pupil-teacher ratio.
3.2.2 House Prices Dataset, AMES
This dataset consists of 2930 observations of house value (log-scaled) and 81 characteristics such as overall quality, year of construction, surface information. The set of features was brought down to the following ten variables: OverallQual, GrLivArea, YearBuilt, GarageCars, TotalBsmtSF, GarageArea, X1stFlrSF, FullBath, YearRemodAdd, LotArea.
3.2.3 Coronary Heart Disease, CHD
This dataset contains 4238 observations of patients: socio-demographic information (e. g. gender), medical information (e. g. diabetes), medical examination (e. g. glucose) and finally whether the patient developed Coronary Heart Disease during the following ten year period. For this problem, we assumed gender as an already-known feature and one-to-one relationship between actions and features except for one: action 6 reveals diastolic and systolic blood pressure simultaneously, see Figure 2.
Problem | Metric | Bound | Oracle | Intelligent Questionnaire | Best q subset | CART, maxdepth = q |
Toy models | ||||||
#1 | AUC | 1 | 1 (0) | 0.87 (0.01) | 0.75 (0.02) | 0.81 (0.01) |
#2 | RMSE | 0.2 | 0.3 (0.01) | 0.37 (0.01) | 0.46 (0.01) | 0.41 (0.01) |
#3 | RMSE |
|
1.56 (0.02) | 1.57 (0.03) | 1.83 (0.03) | 1.84 (0.02) |
Benchmark datasets | ||||||
Boston Housing | RMSE | 4.99 (0.57) | 4.92 (0.54) | 5.33 (0.54) | 5.16 (0.64) | |
AMES Housing | RMSE | 4.3 (0.22) | 4.56 (0.14) | 4.67 (0.19) | 5.62 (0.15) | |
CHD | AUC | 69.73 (1.92) | 62.38 (3.2) | 61.04 (4.35) | 60 (3.41) |
3.3 Results
For each problem, we replicate the train/test split at random 10 times, calibrate our Intelligent Questionnaire as well as three baselines on the training set and evaluate on the test set. Toy model #1 and CHD being classification problems we used the AUC metric for comparison, whilst we used RMSE for all other problems.
In Table 1 we report the test set performance average and its standard deviation in parenthesis. The oracle corresponds to the model using all p features and best q subset relies on the fixed subset of features which performs best on the training set. CART algorithm with maximum depth q is a decision tree calibrated using R package Rpart. On all problems, whether it be toy models or benchmark datasets, the Intelligent Questionnaire outperforms both best q subset and the decision tree with maximum depth. For toy models, optimal performance bounds are provided as element of comparison.
In Figure 3 are represented the question sequences asked by the Intelligent Questionnaire on a test set of the AMES problem. The thickness of the arrows indicate the proportions of cases making the transitions. The best q subset was found to be (OverallQual, GrLiveArea, YearBuilt). Those variables still matter a lot in the Intelligent Questionnaire, but it seems to be more interesting, depending on OverallQual observed to ask for GarageArea or YearBuilt. This questionnaire manages which information is more important to get depending on previously recorded values, as expected. Note that without initial information, the first feature asked for is systematically the same. In the CHD problem however, we witnessed that depending on the gender, known initially, the first question asked varied.
4 Reinforcement Learning on Toy Model
In practice, it is interesting to be able to update sequentially our model whilst using it, as presented in section 3. In this section we investigate our ability to quickly find a proper questionnaire on one of the toy models introduced earlier using a Reinforcement Learning approach [17]. The associated exploration-exploitation dilemma is the following: choosing questions enhancing our knowledge of
4.1 Specifics
We considered toy model #3 presented in preceding section: consider
The target variable is defined by the simple linear model
where
We start with 32 samples for each q-question combination and then at each iteration we sample a total of
We considered the softmax and UCB exploration policies, presented in section 2.4, setting, quite arbitrarily two different exploration parameters. For all but one of the softmax approaches, we used the square root absolute prediction error instead of the squared error, because the latter has a highly-skewed distribution towards zero, which seems to be causing issues in the neural network training. It however may be that some parameters in the training on the neural network should be changed for the approach to perform properly.
The template neural network used for computing
In order to obtain our results, we repeated each scenario i. e. exploration policy in 10 experiments with set seed values.
4.2 Results
In order to analyze and compare the algorithms performances we represent here the Root Mean Squared Error (RMSE) of prediction observed on test and acquired samples on figures 4 and 5 respectively.
Examining Figure 4, which represents the RMSE distribution over time steps and for the different policies, we can make several observations: (a) UCB methods take a longer time to reduce RMSE in contrast with softmax exploration, (b) UCB methods also present overall more variability than softmax experiment do, (c) the reward distribution matters, as the softmax policy relying on original squared errors is the only on which seems to stuck to a floor level, two times larger than the lower bound
One can note that, because of the particular parameters, in the case of softmax the less-exploratory option (
We followed the analysis by checking which action was selected first in the learnt questionnaire and we observed that the softmax approach allows to converge quickly to picking question 3 first for almost all experiments around step 15, whilst this proportion is around 75 % for UCB(100) and 50 % for UCB (50).
Let us now analyze how the different exploration policies behaved on the acquired samples based on Figure 5. It appears that the RMSE on acquired samples drops only for the softmax policies, indicating that the samples collected were close to optimal selections. This finding was confirmed by checking the proportion of triplets sampled (1,2,3) and (3,4,5).
4.3 Suggestions
In this experiment, it appears that the softmax exploration policy seems to be more easy to deal with compared to UCB from an exploration perspective: it allows for much more exploration, without waiting for next steps; as such the learning happens more quickly and when we finally differentiate in terms of Q-values it clearly stands out, whereas in UCB it is not so clear, performs poorly in comparison in terms of cumulative regret as well as final regret. Also it seems more intuitive to compare Q-values amongst themselves rather than with an exploration bonus of a completely different nature. Also, it seems that reward distribution (at the last layer of the questionnaire) does matter, comparing the case where the squared error was taken instead of the root absolute error. Something we thoughtfully omitted, is that the reward function changes throughout iterations, since it is defined based on our prediction
Finally, we could consider a more permissive version of the game, where we can actually ask more than three questions, but are penalized for it. That way, we could potentially learn much more, and it would open the way to more diverse exploration strategies.
5 Conclusion
In this work we built an adaptive predictive-questionnaire under constraint over the number of questions, motivated by the important balance between data acquisition and user experience. As this is a sequential decision-making problem we used a Markov Decision Process to model it. Furthermore, its episodic and hierarchical structure allowed us to apply an approximate dynamic programming approach to learn the best adaptive questionnaire based on available data on couple (X, Y), which is the standard setting in supervised learning. Regarding the evaluation of this approach, we have shown on three toy models as well as three classic benchmark datasets that our approach outperforms (a) a decision tree submitted to the same budget constraint of questions (b) classic models based on the most informative subset of questions. Option (b) being non adaptive and option (a) being limited to one-dimensional splits of data, our approach allows for much more flexibility. The application on the third dataset, which is the closest to our target application, showed that this approach can integrate easily some initially-known features and how actions unveil features. Finally, we showed on a toy model that our approach can be pursued in an online setting, which is interesting for practical applications. As a continuation of this work, we have a few ideas for further research.
Scaling with ( p , q )
In our approach we relied on an approximation of state-value functions based on neural networks, without considering much on the dimension parameters. It is apparent that the higher
Stopping Criterion
In the algorithm constructed, we assumed a budget constraint over the features requested. This approach makes sense in some applications as it reduces globally the number of requests from us to the user and it simplifies some computational aspects. We could also consider the case where we would stop asking questions as soon as we believe we have gathered enough information on X in order to predict Y to a satisfying level.
Performance Criterion
As we defined it, the performance of our algorithm is quantified through predictive power. From the user experience perspective, it might be worth taking into account the cost of each question/action: in the Coronary Heart Disease problem, some medical exams and checks might have higher costs than others and as such be favored differently. Bringing this work a step closer to Human-Computer Interaction, we could even consider choosing the appropriate question format e. g. radio-button choice versus numeric input. Overall this is about balancing predictive power and information retrieval cost, keeping in mind that such cost is related to user experience.
Truthful Responses
Throughout this work we assumed that when an element of X is asked for it is revealed exactly. As such, the same element is never asked for twice and we completely trust the response given. In some applications, the order and overall position of a question in a survey affects the answer given, which can sometimes have very important consequences, see [12]. Therefore, the answer given should be treated as a noisy version of the truth, where the noise actually depends on the order followed and previous answers that were given. This would lead us to model the interaction through a Partially Observable MDP [11].
Towards Human Computer Interaction, Online Evaluation
The framework we worked on suffers from the same issues one might find in the supervised learning setting, which assumes that independent and identically distributed pairs of random variables
About the authors
Trained in statistics, machine learning and programming, I am currently pursuing a PhD at the Centre de Mathématiques APpliquées (CMAP) of Ecole Polytechnique. My current research interest is in the application of Reinforcement Learning (RL) techniques to real-world problems. I am also interested in bandit problems and generative methods for data augmentation in RL.
I have been an associate professor in the Department of Applied Mathematics at École Polytechnique since September 2013. My research is carried out at the Centre de Mathématiques APpliquées (CMAP). I work there on data problems in signal processing, statistics and learning with a taste for applications. I am co-directing with Emmanuel Gobet the SIMPAS team (Signal IMage Probabibilités numériques et Apprentissage Statistique). I am also a member of the Inria Xpop team. I am in charge of many training programs offered by the school: PA of Data Science, MScT Data Science for Business and Data Science Starter Program of Polytechnique Executive Education. I also contributed to the creation of the M2 Datascience at the Institut Polytechnique de Paris.
I’m an International Expert in Data Science and Digital who had the good fortune to believe, at the beginning of the 21st century, that the advances in Machine Learning and Artificial Intelligence could be of great help in solving scientific problems for industrial applications. It is now a truism that the whole world will be impacted in profound ways by the revolution of Artificial Intelligence. Lucky enough to have jumped early into this exciting field, I completed my PhD in Machine Learning and Data Mining in 2006, and have ever since been addressing various data science challenges both in academic and industrial settings.
References
[1] Framingham Heart study dataset. https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset. Accessed: 2020-05-01.10.1007/978-1-4614-6439-6_802-3Search in Google Scholar
[2] Framingham Heart Study, Three Generations of Dedication. https://framinghamheartstudy.org. Accessed: 2020-05-01.Search in Google Scholar
[3] Bellman, R. Dynamic Programming, 1 ed. Princeton University Press, Princeton, NJ, USA, 1957.Search in Google Scholar
[4] Bertsekas, D. P., and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996.Search in Google Scholar
[5] Besson, R., Pennec, E. L., Allassonniere, S., Stirnemann, J., Spaggiari, E., and Neuraz, A. A model-based reinforcement learning approach for a rare disease diagnostic task. arXiv preprint arXiv:1811.10112 (2018).Search in Google Scholar
[6] Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. Classification and regression trees. CRC press, 1984.Search in Google Scholar
[7] Chen, Y., Chen, B., Duan, X., Lou, J.-G., Wang, Y., Zhu, W., and Cao, Y. Learning-to-ask: Knowledge acquisition via 20 questions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), pp. 1216–1225.10.1145/3219819.3220047Search in Google Scholar
[8] De Cock, D. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19, 3 (2011).10.1080/10691898.2011.11889627Search in Google Scholar
[9] Dunlop, M. D. Ontology-Driven, Adaptive, Medical Questionnaires for Patients with Mild Learning Disabilities. In Artificial Intelligence XXXVI: 39th SGAI International Conference on Artificial Intelligence, AI 2019, Cambridge, UK, December 17–19, 2019, Proceedings (2019), Springer, p. 107.10.1007/978-3-030-34885-4_8Search in Google Scholar
[10] Harrison Jr, D., and Rubinfeld, D. L. Hedonic housing prices and the demand for clean air.Search in Google Scholar
[11] Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial intelligence 101, 1-2 (1998), 99–134.10.1016/S0004-3702(98)00023-XSearch in Google Scholar
[12] Magelssen, M., Supphellen, M., Nortvedt, P., and Materstvedt, L. J. Attitudes towards assisted dying are influenced by question wording and order: a survey experiment. BMC medical ethics 17, 1 (2016), 24.10.1186/s12910-016-0107-3Search in Google Scholar PubMed PubMed Central
[13] Mwamikazi, E., Fournier-Viger, P., Moghrabi, C., Barhoumi, A., and Baudouin, R. An adaptive questionnaire for automatic identification of learning styles. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (2014), Springer, pp. 399–409.10.1007/978-3-319-07455-9_42Search in Google Scholar
[14] Nokelainen, P., Niemivirta, M., Kurhila, J., Miettinen, M., Silander, T., and Tirri, H. Implementation of an adaptive questionnaire. In Proceedings of the ED-MEDIA Conference (2001), pp. 1412–1413.Search in Google Scholar
[15] Provost, F., Melville, P., and Saar-Tsechansky, M. Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce. In Proceedings of the ninth international conference on Electronic commerce (2007), pp. 389–398.10.1145/1282100.1282172Search in Google Scholar
[16] Puterman, M. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 2014.Search in Google Scholar
[17] Sutton, R. S., and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.Search in Google Scholar
© 2020 Walter de Gruyter GmbH, Berlin/Boston