Nothing Special   »   [go: up one dir, main page]

An Automated Estimator of Image Visual Realism Based On Human Cognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2014 IEEE Conference on Computer Vision and Pattern Recognition

An Automated Estimator of Image Visual Realism Based on Human Cognition

Shaojing Fan1,2 , Tian-Tsong Ng2 , Jonathan S. Herberg4 , Bryan L. Koenig3 , Cheston Y. -C. Tan2 , and
Rangding Wang∗1
1 3
Ningbo University Lindenwood University
2 4
Institute for Infocomm Research Institute of High Performance Computing

Abstract

Assessing the visual realism of images is increasingly


becoming an essential aspect of fields ranging from com-
puter graphics (CG) rendering to photo manipulation. In
this paper we systematically evaluate factors underlying hu-
man perception of visual realism and use that information to
create an automated assessment of visual realism. We make
the following unique contributions. First, we established a
benchmark dataset of images with empirically determined
visual realism scores. Second, we identified attributes poten-
tially related to image realism, and used correlational tech-
niques to determine that realism was most related to image
naturalness, familiarity, aesthetics, and semantics. Third, we
created an attributes-motivated, automated computational
model that estimated image visual realism quantitatively. Us-
Figure 1. Image type may not indicate visual realism – photos
ing human assessment as a benchmark, the model was below
may appear unrealistic whereas CG images can appear very real.
human performance, but outperformed other state-of-the-art Above are images of different realism levels from our visual re-
algorithms. alism dataset. Half of the images in each row are CGs, half are
photos. The number in parentheses represents the realism score
(the proportion of participants who rated the image as a photo rather
1. Introduction than as CG).

Visual realism is defined as the degree an image appears


to people to be a photo rather than computer generated. Pre- to detecting and improving the realism of composite im-
dicting image visual realism is a challenging yet important ages [14, 28]. However, we are unaware of any research that
task for the visualization and CG communities. For instance, has systematically analyzed the perceptual factors relevant to
image realism could be used as a metric for CG image qual- the visual realism of images of general scenes, and how these
ity evaluation or during manipulation of the realism level perceptual factors could be turned into a quantitative realism
of computer games. Image realism could also be integrated estimation problem. Current datasets in related fields only
into content-based image retrieval and image forensics. contain labels of image type, with no ground truth on realism
Over the last decade, some noteworthy research has pro- score. Therefore they are not suitable for quantified realism
vided a base for understanding visual realism. In the CG assessment. Besides, there is no set of unified evaluation
community, scholars have analyzed the impact of render- criteria for such a quantitative estimation.
ing parameters like illumination and shadow on how similar
Our research differs from previous work in computer
a CG image is to reality, i.e., its CG fidelity [17, 21]. In
vision on image-type classification. Our method is realism-
the computer vision field, much research has been devoted
centric, focusing on estimating the realism level of individual
∗ Corresponding author. E-mail:wangrangding@nbu.edu.cn images regardless of their types (Fig. 1). In this paper, we

1063-6919/14 $31.00 © 2014 IEEE 4197


4201
DOI 10.1109/CVPR.2014.535
Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Sample CG images (top) and photos (bottom) from our dataset distributed based on degree of realism. The numbers on the bar
represent realism score (the proportion of participants who rated the image as a photo rather than as CG).

develop a computational approach to realism estimation that realism is perceptual; image type is not. Second, realism
incorporates factors empirically related to human realism scores range from 0 to 1, whereas photo-vs-CG is a binary
assessment. The paper has three goals. First, to construct a distinction. Image type does not necessarily indicate im-
unified benchmark dataset for quantitative realism estima- age realism level, and vice versa (Fig. 1). Third, our study
tion (Sec. 2). Second, to explore the high level attributes included matte paintings, which are hybrid images charac-
related to visual realism of images (Sec. 3). Third, to de- terized naturally by visual realism but not by image type as
velop a model rooted in machine-learning for automatically photo or CG.
inferring realism of images using their visual content, and to Image composites evaluation: Some studies have focused
assess model performance in terms of the degree to which it on understanding and assessing realism of composite im-
matches human performance (Sec. 4). ages [14, 28]. Evaluation of various image statistical mea-
sures has indicated that the most important factors for realism
1.1. Related work of composite images are illumination, color, and saturation.
CG fidelity: Since the early 1980’s, research has explored
CG fidelity [17, 21]. A common approach has been con- 2. Visual realism benchmark dataset
trolled experiments in which participants judge between a We established a benchmark dataset based on quantitative
real scene and its CG replica generated with different param- measures of the visual realism of each image. Visual real-
eter settings. Recent work [7] showed that realism perception ism scores were collected from a large-scale psychophysics
of face images is related to intrinsic image components such study on Amazon Mechanical Turk (MTurk). The following
as shading and reflectance as well as cognitive factors such section describes the assembly of the dataset and the study.
as viewers’ expertise and ethnicity. However, these studies
were conducted using datasets limited by specific scenes and 2.1. Dataset construction
small sample sizes. The current work included additional The dataset consists of 2520 images, among which half
visual factors and other image attributes important to visual are photos and half are CG images. Sample images and
realism. We based our study on a large-scale dataset with a dataset statistics are shown in Fig. 1, 2 and 3.
variety of scenes. A good dataset should reflect the types of images we
Image type classification: Computer vision researchers encounter in daily life. However, digital technology has
tend to focus on how to classify images as photos or CG advanced sufficiently that hybrid images whose image type
based on various image characteristics, such as higher- is difficult to determine have become common. For instance,
order wavelet statistics [16], physics-motivated geometry digital matte painting (MP) images are now common in
features [19], and physical noise rendered by cameras [5]. movies. Digital MP images are often composed of a CG
However, these methods are not directly rooted in human image superimposed on a base plate (a photo or moving
perception, which is an essential contrast to our approach. footage; Fig. 4). Our dataset differs from others in related
Although these algorithms developed apart from consider- fields in that it includes digital MP images. We selected
ing human perception can often reach high classification those for which over 1/3 of their area was CG.
accuracy, the features used are usually sensitive to such im- We excluded CG images with unrealistic content, like
age manipulations as compression and post-processing. Our spaceships flying in a city. Obviously unrealistic CG images
work on visual realism differs fundamentally from the pre- like cartoons were also excluded. All images were scaled
vious photo-vs-CG classifiers in three ways: first, visual and cropped about their centers to be 256×256 pixels.

4202
4198

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
indicating that the image is a photo over the total number of
judgments for that image. The distribution of realism scores
and sample images of different realism levels are shown in
Fig. 3, and 2, respectively.

2.3. Dataset statistics


We wanted our findings to generalize to the various types
of images people often see. We also hoped the computational
model built on the dataset could be generalizable in terms of
image type, realism level, and image content. So significant
diversity in images is important. Fig. 3 summarizes the
statistics of our dataset. For more detailed information of
Figure 3. A depiction of the variation across our visual realism our dataset, users could refer to our project website [1].
dataset. Each bar is labeled by variety-category (leftmost labels).
Within each category, different specific features, labeled above the 3. Measuring attributes and visual realism
sub-bars, apply to CG images versus photos (except the last two
rows). Sub-bar lengths represent proportions. In the first category, In order to assess images’ visual realism by constructing
high realism, medium realism, low realism indicate realism scores a computational model similar to human perception, we
between the ranges of (.67, 1], (.33, .67], [0, .33], respectively. first investigated image attributes relevant to people’s visual
realism perception and modeled visual realism empirically.

3.1. Psychophysics study II: attributes annotation


We recruited a new group of 3794 MTurk participants
to annotate the images (Table 1; for complete questionnaire
see supplementary material [1]). On average, 10 participants
annotated each image. We also had images labeled via La-
Figure 4. An example of digital matte painting. Left: Final matte
painting. Right: Original image before applying matte painting.
belMe [23], an online annotation tool (a31−32 in Table 1).
Courtesy of Matte World Digital, CA. Due to budget constraints, these tasks were done for half of
the entire image set. These 1260 images, which we refer to
2.2. Psychophysics study I: perceptual realism as the annotated subset, were selected so they had realism
scores distributed as uniformly as possible for both photos
Study design: We had workers on MTurk view a sequence and CG images, over the entire realism score range.
of images and judge each as “CG” or “photo”. We defined
CG images as entirely or in part created using computer 3.2. Correlation of attributes and visual realism
software. To estimate how diverse our participants were
We measured and investigated the relationships between
regarding prior familiarity to CG images and photography,
image attributes and visual realism by using the realism
we asked participants to select one or more options that
scores we got in Study I (Sec. 2.2) as ground truth. We
best fit their background from “have jobs related to graphic
used the Spearman’s rank-order correlation (ρ) and one-way
design”, “keen computer game players”, “photographers
ANOVA [22] to assess such relations (see Table 1 and Fig. 5).
or photography enthusiasts”, and “laypersons”. We paid
workers $1.00 for completing the task, and to encourage Realism ratings: We asked participants to rate the degree to
participants to try their best we paid a $0.20 bonus to workers which images appeared to be a photograph versus computer
whose accuracy exceeded 90%. generated (a1 ) on a five-point scale (1 = computer generated,
Empirical realism score: We performed a pilot study to 5 = photograph). These ratings strongly correlated with the
determine how many participants are necessary to provide human realism scores we got in Study I (ρ = .80). The partic-
sufficient reliability for visual realism assessment. Split-half ipants for the two tasks were different, so this demonstrates
correlations and root mean square error analysis suggested the stability of human perception of visual realism over both
that 30 judgments per image is enough (for details see supple- measurements.
mentary material [1]). Based on the pilot study, we recruited Familiarity: Familiarity attributes (a2 , a4−5 ) correlated sub-
1292 participants from MTurk (for all studies we required stantially with realism (ρs = .23, −.33, −.36, respectively).
workers to have > 95% approval rating in Amazon’s sys- This might be because people obtain greater capacity for as-
tem). Each image was judged by a mean of 31 participants. sessing image realism from prior exposure to similar scenes.
We calculated a realism score (ranges from 0 to 1) for each Consistent with this, previous research suggests that people
image as the proportion equal to the number of judgments have specific memories of common objects entities such as

4203
4199

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
Table 1. Image attributes (Attr), related survey item, attributes category, and their Spearman’s rank correlations (ρ) with ground truth image
realism scores (from Study I). Meaningful and statistically significant correlations (|ρ| > .15, p < .05) are highlighted in bold. Numbers in
parentheses are participants’ mean ratings for each attribute standardized to a scale of 0 to 1.
Attr Survey item Category ρ Attr Survey item Category ρ
a1 Appears to be a photograph? ( .68) Realism .80* a21 Clean scene and objects? ( .83) Layout .07*
a2 Familiar with the scene? ( .60) Familiarity .23* a22 Makes you happy? ( .60) Emotions .08
a3 Familiar with the objects? ( .76) Familiarity .15* a23 Makes you sad? ( .08) Emotions -.10
a4 Unusual or strange? ( .28) Familiarity -.33* a24 Exciting? ( .56) Emotions -.16*
a5 Mysterious? ( .32) Familiarity -.36* a25 Contain fine details? ( .58) Texture -.03
a6 Lighting effect natural? ( .74) Illumination .49* a26 Dynamic scene? ( .33) Semantics -.15*
a7 Shadows in the image? ( .60) Illumination -.15* a27 Is there a storyline? ( .43) Semantics -.25*
a8 How sharp are the shadows? ( .37) Illumination -.07* a28 Contain living objects? ( .36) Semantics .06
a9 Color appearance natural? ( .82) Color .47* a29 Naturalness of objects? ( .77) Semantics .36*
a10 Colors go well together? ( .88) Color .15* a30 Object combinations natural? ( .76) Semantics .20*
a11 Colorful? ( .53) Color .05 a31 Number of unique objects ( .60) Semantics -.09*
a12 Image quality ( .69) Quality .04 a32 Total number of objects ( .72) Semantics -.06
a13 Image sharpness ( .72) Quality .10* a33 Number of people ( .49) Human semantics -.08
a14 Expert photography? ( .57) Aesthetics .33* a34 Face visible? ( .18) Human semantics .24*
a15 Attractive to you? ( .69) Aesthetics .03 a35 Is the person attractive? ( .35) Human semantics -.12
a16 Close-range or distant-view? ( .63) Layout .04 a36 Making eye contact with viewer? ( .12) Human semantics .13
a17 Have objects of focus? ( .71) Layout .00 a37 Posing for the image? ( .22) Human semantics -.10
a18 Neat space? ( .70) Layout .10* a38 Human activities ( .48) Human semantics .01
a19 Empty space? ( .48) Layout .01 a39 Human expressions ( .40) Human semantics .03
a20 Perspective natural? ( .75) Layout .33* a40 Expression genuine? ( .43) Human semantics .38*
* p < .05.

Figure 5. Distribution of realism R of images with respect to lighting naturalness L (a), color naturalness C (b), degree of expert photography
E (c), unusualness U (d), mysteriousness M (e), and degree of having a storyline S (f). Also shown are example images that demonstrate
such correlations (e.g. the left image in (f) shows an image that does not seem to have a storyline, but is more realistic). In the left graph of
each set, the black line stands for y = x (first row) or y = 1 − x (second row), the red line is the linear regression line of all image points.

the sky or skin. Therefore an image may look more natural define visual realism as the degree to which an image ap-
or realistic if the coloring of image entities coheres with pears to be a photograph versus computer generated. As seen
memory representations [14, 3]. from their distinct definitions, naturalness and visual realism
have intrinsic differences in terms of evaluation criteria and
Color: Color naturalness (a9 ) moderately correlated with
perceptual process.
realism (ρ = .47), which is consistent with previous find-
ings on image composites [14, 28]. However, there was Illumination: The naturalness of lighting (a6 ) correlated
no significant correlation between colorfulness (a11 ) and moderately with realism (ρ = .49), suggesting the impor-
realism, which contrasts with [3], who found that colorful- tance of illumination for realism. This accords with previous
ness was a key attribute to image naturalness. This may research which suggests that image properties like illumina-
imply that image naturalness and image visual realism are tion, shadow, and surface roughness are important factors for
not based on entirely the same perceptual processes. In CG fidelity [21, 7]. However, we did not observe meaning-
previous studies, naturalness was defined as the degree of ful correlation between shadow characteristics (a7−8 ) and
correspondence between an image presented on an imaging realism. This contrasts with prior research suggesting that
device and memories of real-life scenes [3], whereas we shadow softness is an important factor for CG fidelity [21].

4204
4200

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
This difference might be because [21] used images of sim- small correlations with realism individually had stronger cor-
ple objects, while our images consisted of varied scenes, relations jointly (Fig. 6), such as attractiveness (a15 ), image
entailing more complex and varied shadowing effects. Al- quality (a12 ), and presence of living objects (a28 ).
ternatively, whereas [21] used a fixed viewing environment
and rendering parameters, ours were uncontrolled.
Aesthetics: The degree to which an image appeared to be
a work of expert photography (a14 ), an aesthetics attribute,
moderately correlated with realism (ρ = .33). Interestingly,
this correlation was negative (ρ = −.23) for images with
realism scores greater than 0.8. This might suggest that
Figure 6. Feature selection results. Left: Spearman’s rank correla-
more aesthetics in a highly realistic image can lower its tion between predicted realism score and human realism scores as a
realism. This is consistent with prior research on human skin function of the number of predictor attributes. Right: Independent
rendering [8], which suggests that maximal attractiveness prediction performance of top 10 attributes.
and extreme realism were opposing perceptions. Despite
Principal component factor analysis: Several attributes
the somewhat non-linear relationship between aesthetics
from feature selection were correlated, such as mysterious-
and realism, we still used linear regression for simplicity.
ness and strangeness. We performed a principal component
Modeling the non-linear relationship is left to future work.
(PC) factor analysis with varimax rotation [22] to remove
Spatial layout: We found that the naturalness of perspective the high inter-correlations and identify a compact set of at-
(a20 ) influences realism (ρ = .33). This has been noted in tributes related to realism. The 10 attributes from feature
the CG community by [6], who investigated the impact of selection were grouped into 4 major PCs based on factor
viewpoint on apparent realism of virtual crowds. analysis, which most strongly correlated with naturalness,
Semantics: The naturalness of object appearances (a29 ) aesthetics, familiarity, and semantics, respectively (Table 2).
and of object combinations (a30 ) both correlated moderately The “Cumulative variability” row shows that the 4 PCs ac-
with realism (ρ = .36, .20, respectively). However object counted for nearly 65% of the variability in the 10 attributes.
statistics (a31−32 ) did not appear to influence realism. This Multiple regression: Finally, PC scores were computed as a
accords with [21], who showed that the number and diversity weighted average of the 10 attributes (with factor loadings as
of objects have minimal influence on realism. The amount weights). We predicted realism scores with these PC scores
of semantic information an image conveys (a27 ) negatively using multiple regression, adjusted R2 = .44, p < .001. Seen
correlated with realism (ρ = −.25), suggesting that explicitly from Table 3, naturalness strongly predicted realism, while
dramatic scenes appear less realistic. We performed one-way aesthetics, familiarity and semantics weakly but significantly
ANOVAs to investigate the effect of scene and object type predicted realism. The relative predictive ability of this statis-
on visual realism (for detailed categories see Fig. 3). Results tical model is consistent with the computational performance
suggested a significant effect of scene and object types on of each component presented in Sec. 4.2 (Table 4).
realism, F s(12, 2507) > 4.81 , ps < .05.
Table 3. Principal components, their standardized coefficients (β),
t value, and significance (p) in multiple regression with realism.
3.3. Empirical visual realism model
Component β t p
We used feature selection and multiple regression to deter- Naturalness .63 29.98 .000
mine which factors most influenced visual realism. Finally, Aesthetics .14 6.53 .000
image visual realism was modeled by the major factors based Familiarity -.11 -5.19 .000
on the psychophysical data. Semantics -.12 -5.54 .000
Feature selection: We used the attributes as features for
training support vector regressor (SVR) [2] to predict image 4. Computational visual realism
realism. We used grid search to select cost, RBF kernel pa-
rameter γ, and  hyperparameters. We split the 1260 images We designed features motivated by image attributes rele-
from the annotated subset into 80% as a training set and vant to visual realism. We built a computational model for
20% as a test set. We performed a greedy feature selection. quantitative realism assessment based on these features. We
The prediction performance was evaluated using Spearman’s also compared our model with state-of-the-art algorithms
rank coefficient between predicted realism scores and human using human realism scores as a benchmark.
realism scores (from Study I). As shown in Fig. 6, perfor-
4.1. Image features for visual realism
mance improved with more attributes, but improved little
with more than 10 attributes. Therefor we selected the 10 top Based on our psychophysics studies, visual realism corre-
attributes for modeling visual realism. Some attributes with lated strongest with naturalness, aesthetics, familiarity, and

4205
4201

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
Table 2. The loadings of 10 selected attributes on the 4 major principal components (PC). Bold numbers are the strongest loading of each
attribute on one of the PCs. The “Cumulative variability” row shows how each PC cumulatively explains the variability of 10 attributes in
presented sequence.
hhh
hhh
hhhPrincipal components
hhh 1 (Naturalness) 2 (Aesthetics) 3 (Familiarity) 4 (Semantics)
hhh
Attributes hh h
Naturalness of color appearance (a9 ) .88 .13 -.19 .03
Naturalness of lighting effect (a6 ) .87 .20 -.17 .02
Image quality (a12 ) .06 .79 -.03 -.05
Attractiveness (a15 ) .10 .74 -.20 .18
Expert photography (a14 ) .44 .67 -.02 -.10
Unusualness/strangeness (a4 ) -.07 -.07 .71 .03
Mysteriousness (a5 ) -.10 -.03 .71 .19
Objects familiarity (a3 ) .27 .17 -.64 .21
Containing living objects(a28 ) .10 .00 -.16 .78
Having storyline (a27 ) -.08 .03 .31 .71
Cumulative variability explained (%) 18.41 35.53 52.03 64.36

semantics. We identified automated methods to determine meant for image retrieval applications, we used it here to
feature values that corresponding to these attributes. Instead quantify familiarity. The familiarity measure was denoted
of simple concatenation, we applied kernel sum to fuse the by the distances of the top 50 matches. Second, [14, 3] sug-
features for support vector regression. gested that an image may look more realistic if its coloring
Naturalness: We modeled image naturalness in three ways. coheres with memory representations. We included color
First, [7] suggested that shading and reflectance affect visual compatibility [14] as a measure for color familiarity. We
realism differently. This inspired us to model naturalness also included color name features learned from real-world
using intrinsic image components. We first decomposed images [26] to better represent daily color compositions. We
each image into intrinsic components by extending Retinex densely sampled the feature with a grid spacing of 4 and
algorithm into RGB space [9]. We then computed three learned a dictionary of size 256. We then applied 2-level
256-bin histograms for each image, to represent shading and spatial pyramid pooling to obtain the color descriptors.
reflectance components as well as original image. We further Semantics: We applied GIST [20] to model scene structure
calculated the histogram difference between the intrinsic using 4×4 image block. We used automatic Object Bank
components and original image. Second, based on [25] (OB) [15] to model presence of a pre-defined set of objects.
we calculated the image naturalness statistics derived from In OB, an image is represented as a collection of response
the local patch (3×3) structures and image power spectrum. maps of a large number of pre-trained generic object detec-
Finally, unnaturalness was modeled by using the method in tors. We used max pooling on OB features.
[10], who identified simple and uniform color, and strong
edges, as characteristics of CG. 4.2. Results and evaluation
Aesthetics: We applied Ke’s method [11] for extracting Evaluation methods: We evaluated our method by its abil-
aesthetics features, which considers image properties like ity to predict realism scores. As a simple application, our
edges distributions, blur, and contrast. We also used local model was also evaluated in classifying images as photos
self-similarity geometric patterns (SSIM [24]) to represent or CG. For prediction, we used human realism scores from
content symmetry, which is often regarded as a measure of Study I as ground truth, and Spearman’s rank correlation to
aesthetics. We densely sampled the SSIM descriptors with a evaluate the prediction performance from SVR. For classi-
grid spacing of 4 and learned a dictionary of size 100. We fication, image-type labels were used as ground truth, and
used 2-level spatial pyramid pooling on the descriptors. area under ROC curve as evaluating measure (where realism
scores from SVR were treated as image-type probability,
Familiarity: First, we defined a measure for semantic fa-
high realism scores correspond to photo and low correspond
miliarity using the content-based similarity measure com-
to CG). The SVR settings are as described in Sec. 3.3.
monly used in image retrieval. We used 10, 000 images from
the SIMPLIcity dataset [27] as a pre-determined anchor Evaluation results: In Table 4 and Fig. 7, we compared
database of images with common scenes and objects. We performance of various computational methods, as well as
then computed the image similarity by using color, illumi- human judgment and attributes annotation that motivated our
nation and texture information [13], and performed a robust features. For human judgment, we treated human realism
content-based matching with the anchor database. Primarily scores from Study I (Sec. 2.2) as image-type probability in

4206
4202

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
Table 4. Experimental results of realism prediction and image classi- which include high-order correlations of wavelet coefficients
fication. ρ1 and A1 are respectively the Spearman’s rank correlation, [16], physics-motivated geometry structure [19], camera
and area under ROC curve on annotated subset, ρ2 and A2 are those noise [5], and color compatibility for evaluating the real-
on whole dataset1 . The best result from computational features on ism of image composites [14]. We further tested some well
each evaluation metric is highlighted in bold. known object and scene features like SIFT, GIST, HOG2x2,
Prediction Classification and LBP, computed from an open-source library [12]. Fi-
Category Feature type
ρ1 ρ2 A1 A2 nally, we investigated unsupervised feature learning. We
Human Human .652 n.a. .79 .88 adopted the unsupervised feature learning framework with
Naturalness .52 n.a. .62 n.a. a single-layer triangular K-means encoding [4] on image
Aesthetics .39 n.a. .64 n.a. patches preprocessed by local intensity and contrast normal-
Attributes
Familiarity .39 n.a. .57 n.a. ization, as well as whitening. During test, we scan an image
annotation
Semantics .30 n.a. .61 n.a. with 16-by-16 pixel receptive field and 1 pixel stride, before
All combined .66 n.a. .67 n.a. mapping the preprocessed image patches to 256-dimensional
Naturalness .38 .45 .66 .74 feature vectors. The details on feature computation can be
Aesthetics .34 .42 .65 .73
Our found in our project website [1].
Familiarity .33 .42 .64 .74
method Our results suggest the following three things:
Semantics .28 .37 .61 .67
All combined .41 .51 .68 .77 First, both attributes annotation and our features predicted
Wavelet [16] .16 .20 .56 .63 image realism moderately well (ρs > .28; Table 4). Among
Geometry feature [19] .31 .47 .64 .74 the four factors, naturalness predicted best, which is consis-
Signal tent with our regression model (Table 3), indicating natural-
Camera noise [5] .04 .06 .53 .50
feature
Color compatibility [14] .20 .23 .57 .61 ness is the most important factor among the 4 components.
SIFT [12] .28 .34 .61 .66 Second, although the performance of our method was
Object & GIST [12] .16 .23 .58 .61 lower than that of attributes annotation in prediction task,
scene feature HOG2x2 [12] .28 .33 .58 .66 our method slightly outperformed attributes annotation in
LBP [12] .25 .30 .59 .64 classification task (Table 4). This suggests that our attributes-
Feature K-means .28 .37 .63 .71 motivated features represent human annotation to a certain
learning encoding [4] degree.
1 Results are consistently better on whole dataset than annotated subset. Third, our combined features outperformed other com-
It might be because the images in the subset was purposefully selected
puter algorithms in all evaluation metrics, suggesting not
to make a uniform distribution on realism scores. Thus these images
are intrinsically harder to be distinguished. only that our method is the most similar to human perception,
2 This result is the split half consistency among participants for study II. but also that understanding human perception helps create
better computational models. The low performance of the
camera noise feature might be due to its sensitivity to image
compression and post-processing. Unsupervised learning
features were among the best, but humans performed the
best on both tasks.
Limitation: As seen from Fig. 8, our method overpredicted
realism for images with unusual scenes (including CG per-
sons), whereas it underpredicted realism for images that are
common scenes but with unusual illumination or image qual-
Figure 7. ROC curve of binary image classification on whole dataset.
ity. This might indicate that one limitation of our method
Our method outperforms other computer algorithms, yet is still far is on scene understanding. Investigating scene semantics
below human performance. might be fruitful. For example, we could fully utilize the
data collected from LabelMe or explore image context.

classification tasks. For attributes annotation, we grouped the 5. Conclusion


10 selected attributes in Sec. 3.3 into 4 components (defined
in Table 2) and used them as training features for SVR. The In this paper we have shown that predicting image visual
attributes annotation was only on the annotated subset, so realism is a task that can be addressed with current com-
for comparison we tested all computer methods on both the puter vision techniques. We constructed an image realism
whole dataset as well as the annotated subset. benchmark dataset and designed a realism predictor moti-
We also compared our method with the signal process- vated by human-annotated image attributes. To the best of
ing features commonly used in CG and photo classification, our knowledge, this work is a first realism-centric study that

4207
4203

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.
[8] F. Giard and M. J. Guitton. Beauty or realism: The dimen-
sions of skin from cognitive sciences to computer graphics.
Computers in Human Behavior, 2010. 5
[9] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman.
Ground truth dataset and baseline evaluations for intrinsic
image algorithms. In ICCV, 2009. 6
[10] T. I. Ianeva, A. P. de Vries, and H. Rohrig. Detecting car-
toons: A case study in automatic video-genre classification.
In International Conference on Multimedia and Expo, 2003.
6
[11] Y. Ke, X. Tang, and F. Jing. The design of high-level features
for photo quality assessment. In CVPR, 2006. 6
[12] A. Khosla, J. Xiao, A. Torralba, and A. Oliva. Memorability
of image regions. In NIPS, 2012. 7
Figure 8. Samples of poorly predicted images by our method. The [13] Kirk. Content based image retrieval. https://github.
number on the left under each image is the ground truth realism com/kirk86/ImageRetrieval, 2013. 6
score evaluated by humans (H), the number on the right is computer [14] J. Lalonde and A. Efros. Using color compatibility for assess-
predicted realism score by our method (C). ing image realism. In ICCV, 2007. 1, 2, 4, 6, 7
[15] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing. Object bank:
A high-level image representation for scene classification &
attempted to quantify visual realism of individual images. semantic feature sparsification. In NIPS, 2010. 6
We have shown a simple application of our realism predictor [16] S. Lyu and H. Farid. How realistic is photorealistic? IEEE
on image classification. For future work, we will incorporate Transactions on Signal Processing, 2005. 2, 7
our realism predictor for perception-based image retrieval [17] G. Meyer, H. Rushmeier, M. Cohen, D. Greenberg, and
and computer graphics rendering. We also plan to develop a K. Torrance. An experimental evaluation of computer graph-
web service for image realism prediction [18]. ics imagery. ACM Transactions on Graphics, 1986. 1, 2
[18] T.-T. Ng and S.-F. Chang. An online system for classify-
ing computer graphics images from natural photographs. In
Acknowledgements Electronic Imaging, 2006. 8
[19] T.-T. Ng and S.-F. Chang. Discrimination of computer syn-
We thank Karianto Leman, Miao Jie and Zhang Fan for
thesized or recaptured images from real images. In Digital
their help in this research. This work is partially supported
Image Forensics. 2013. 2, 7
by Open-end Fund in Information and Communication En- [20] A. Oliva and A. Torralba. Modeling the shape of the scene: A
gineering, Zhejiang, China (No. XKXL1313). holistic representation of the spatial envelope. International
Journal of Computer Vision, 2001. 6
References [21] P. Rademacher, J. Lengyel, E. Cutrell, and T. Whitted. Mea-
suring the perception of visual realism in images. In Render-
[1] Visual realism project. http://www1.i2r.a-star. ing Techniques 2001. 1, 2, 4, 5
edu.sg/˜ttng/VisualRealism/index.html. 3, 7 [22] J. A. Rice. Mathematical statistics and data analysis. Cen-
[2] C.-C. Chang and C.-J. Lin. Libsvm: A library for support gage Learning, 2007. 3, 5
vector machines. ACM Transactions on Intelligent Systems [23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Free-
and Technology (TIST), 2011. 5 man. Labelme: A database and web-based tool for image
[3] S. Y. Choi, M. Luo, M. Pointer, and P. Rhodes. Investigation annotation. International Journal of Computer Vision, 2008.
of large display color image appearance-lll: Modeling image 3
naturalness. JIST, 2009. 4, 6 [24] E. Shechtman and M. Irani. Matching local self-similarities
across images and videos. In CVPR, 2007. 6
[4] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer
networks in unsupervised feature learning. In Conference on [25] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu. On
Artificial Intelligence and Statistics, 2011. 7 advances in statistical modeling of natural images. Journal of
Mathematical Imaging and Vision, 18(1):17–33, 2003. 6
[5] E. Dirik, S. Bayram, H. Sencar, and N. Memon. New features
[26] J. Van De Weijer, C. Schmid, and J. Verbeek. Learning color
to identify computer generated images. In ICIP, 2007. 2, 7
names from real-world images. In CVPR, 2007. 6
[6] C. Ennis, C. Peters, and C. O’Sullivan. Perceptual effects [27] J. Z. Wang, J. Li, and G. Wiederhold. Simplicity: Semantics-
of scene context and viewpoint for virtual pedestrian crowds. sensitive integrated matching for picture libraries. TPAMI,
ACM Transactions on Applied Perception (TAP), 2011. 5 2001. 6
[7] S. Fan, T.-T. Ng, J. Herberg, B. Koenig, and S. Xin. Real or [28] S. Xue, A. Agarwala, J. Dorsey, and H. Rushmeier. Un-
fake?: Human judgments about photographs and computer- derstanding and improving the realism of image composites.
generated images of faces. In Technical Briefs, ACM SIG- ACM Transactions on Graphics, 2012. 1, 2, 4
GRAPH Asia, 2012. 2, 4, 6

4208
4204

Authorized licensed use limited to: UNIVERSIDAD DE LAS AMERICAS PUEBLA. Downloaded on August 29,2023 at 00:06:02 UTC from IEEE Xplore. Restrictions apply.

You might also like