Nothing Special   »   [go: up one dir, main page]

1 s2.0 S0262885623001452 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Image and Vision Computing 137 (2023) 104771

Contents lists available at ScienceDirect

Image and Vision Computing


journal homepage: www.elsevier.com/locate/imavis

Qualitative failures of image generation models and their application in


detecting deepfakes
Ali Borji
Quintic AI, United States

A R T I C L E I N F O A B S T R A C T

Keywords: The remarkable advancement of image and video generation models has led to the creation of exceptionally
Generative models realistic content, posing challenges in differentiating between genuine and fabricated instances in numerous
Image and video generation scenarios. However, despite this progress, a gap remains between the quality of generated images and those
Qualitative failures
found in the real world. To address this, we have reviewed a vast body of literature from both academic pub­
Deepfakes
Image forensics
lications and social media to identify qualitative shortcomings in image generation models, which we have
Object and scene recognition classified into five categories. By understanding these failures, we can identify areas where these models need
Neural networks improvement, as well as develop strategies for detecting generated images and deepfakes. The prevalence of
Deep learning deepfakes in today’s society is a serious concern, and our findings can help mitigate their negative impact. In
order to support research in this field, a collection of instances where models have failed is made available at
here.

1. Introduction detecting deepfakes do exist [24], but they are not as easily accessible to
the general public as qualitative measures, which are simpler to carry
Generated images, also known as synthetic images, are created by out.
machine learning algorithms or other software programs, while real As the quality of generated images continues to improve, it is crucial
images are captured by cameras or other imaging devices. Generated to conduct more in-depth and precise analyses. Thus far, people have
images are not real-world representations of a scene or object, but rather been amazed by the ability of synthesized images to approximate nat­
computer-generated approximations. As such, they lack the authenticity ural scenes. When Photoshop was introduced, significant efforts were
and realism of real images. Deepfakes refer to fabricated media content made to identify manipulated images, and a similar approach is needed
that has undergone digital alterations to effectively substitute the for generated images today. It would be beneficial to compile a set of
appearance of one individual with that of another, creating a highly indicators and other resources to aid in detecting generated images and
convincing outcome. This paper investigates the indicators that can be deepfakes.
utilized for identifying artificially generated images, with a specific We present a collection of indicators that can be examined in a single
focus on detecting deepfakes. image to determine whether it is genuine or generated. Overall, we offer
Despite the abundance of anecdotal evidence shared on social media five classes of these indicators including 1) Human and Animal Body
regarding the weaknesses of image generation models, there has yet to Parts, 2) Geometry, 3) Physics, 4) Semantics and Logic, as well as 5) Text,
be a comprehensive and systematic analysis of these failures. Often, the Noise, and Details, for both portraits and natural landscapes. The
examples shared by people are selectively chosen to showcase instances advantage of utilizing qualitative cues is that they are easily accessible
in which the models perform well, which may lead to a biased percep­ and can be utilized by anyone, potentially serving as the initial step in
tion of their capabilities, and an overestimation of their effectiveness. detecting deepfakes.
While there have been quantitative studies aimed at evaluating and Generated images can appear realistic when viewed from a distance
comparing generative models [4,6], such as the use of metrics like FID or at high resolutions, making it difficult to discern them from actual
[15], these measures can be difficult to interpret and are usually photographs. However, at lower resolutions, nearly all generated images
calculated over large datasets, making them unsuitable for determining lack distinguishable characteristics that set them apart from real pho­
the authenticity of individual images. Quantitative measures for tographs. To illustrate, refer to Fig. 1, which depicts a painting by

E-mail address: aliborji@gmail.com.

https://doi.org/10.1016/j.imavis.2023.104771
Received 27 May 2023; Received in revised form 8 July 2023; Accepted 13 July 2023
Available online 26 July 2023
0262-8856/© 2023 Elsevier B.V. All rights reserved.
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 1. The Fishmarket, Dieppe, 1902 - Camille Pissarro. When observed more closely, it becomes apparent that the faces in the image lack clarity and numerous
details are either incorrect or absent, similar to fake images. Although such images may appear authentic at first glance, scrutinizing them thoroughly is crucial to
avoid overlooking errors. It is advisable to conduct a detailed examination of each object within the image by zooming in and analyzing its shape, features, location,
and interaction with other objects. This approach allows for a more accurate assessment of the image’s authenticity and being free from errors.

Camille Pissarro featuring intricate details. While the overall image may quantitative approaches provide a rigorous and objective way to mea­
seem satisfactory, closer inspection reveals several missing details such sure the effectiveness of generative models, helping researchers to
as distorted facial features. improve their models and develop more advanced generative
This study has a dual purpose. Firstly, it aims to explore the differ­ techniques.
ences between generated images and real-world images. Therefore, this Recently, two metrics have gained popularity, namely the CLIP score
research complements studies that propose quantitative approaches for and the CLIP directional similarity (e.g.[25,26]). The CLIP score evalu­
evaluating generative models. Secondly, it aims to examine qualitative ates the coherence of image and caption pairs by measuring their
methods that can be employed to identify deepfakes and train in­ compatibility. A higher CLIP score indicates a greater degree of
dividuals to become proficient in this task, with the added benefit of compatibility, which can also be interpreted as the semantic similarity
systematically organizing this knowledge. between the image and the caption. Moreover, studies have shown that
the CLIP score has a strong correlation with human judgement. On the
2. Related work other hand, the CLIP directional similarity is used for generating images
based on text prompts while being conditioned on an input image. It
2.1. Quantitative and qualitative approaches to evaluate generative assesses the consistency between the differences in the two images (in
models CLIP space) and the differences in their respective captions.
To obtain a thorough analysis of quantitative metrics for evaluating
Quantitative approaches have emerged as a vital tool to evaluate generative models, please refer to the following references [4,6,33,36].
the performance of generative models. These methods rely on quanti­ Qualitative assessment of generated images entails a human eval­
tative measures to assess how well a model is able to generate realistic uation. The quality of these images is evaluated on various criteria, such
data. One commonly used metric is the Inception Score [31], which as compositionality, image-text alignment, and spatial relations.
evaluates the diversity and quality of generated images based on the DrawBench and PartiPrompts are prompt datasets used for qualitative
classification accuracy of a pre-trained classifier. Another popular benchmarking, that are were introduced by Imagen [29] and Parti [35],
approach is the Fréchet Inception Distance [15], which uses feature respectively. These benchmarks allow for side-by-side human evalua­
statistics to compare the distribution of generated data with that of real tion of different image generation models.
data. Moreover, other metrics such as precision and recall [30] can be PartiPrompts is a rich set of over 1600 prompts in English. It can be
used to evaluate the quality of generated samples in specific domains used to measure model capabilities across various categories and chal­
such as vision, text and audio. Some studies have proposed methods to lenge aspects such as “Basic”, “Complex”, “Writing & Symbols”, etc.
assess the visual realism of generated images (e.g.[11]). These DrawBench is comprised of a collection of 200 prompts that are

2
A. Borji Image and Vision Computing 137 (2023) 104771

Table 1
Description and examples of the 11 categories in DrawBench, compiled from [29].
Category Description Examples

Colors Ability to generate objects “A blue colored dog.”


with specified colors. “A black apple and a green backpack.”
Counting Ability to generate specified “Three cats and one dog sitting on the grass.”
number of objects. “Five cars on the street.”
Conflicting Ability to generate conflicting “A horse riding an astronaut.”
interactions b/w objects. “A panda making latte art.”
DALL-E [27] Subset of challenging prompts “A triangular purple flower pot.”
from [27]. “A cross-section view of a brain.”
Description Ability to understand complex and long “A small vessel propelled on water by oars, sails, or an engine.”
text prompts describing objects. “A mechanical or electrical device for measuring time.”
Marcus et al. [21] Set of challenging prompts “A pear cut into seven pieces arranged in a ring.”
from [21]. “Paying for a quarter-sized pizza with a pizza-sized quarter.”
Misspellings Ability to understand “Rbefraigerator.”
misspelled prompts. “Tcennis rpacket.”
Positional Ability to generate objects with “A car on the left of a bus.”
specified spatial positioning. “A stop sign on the right of a refrigerator.”
Rare Words Ability to understand rare words.a “Artophagous.”
“Octothorpe.”
Reddit Set of challenging prompts from “A yellow and black bus cruising through the rainforest.”
DALLE-2 Reddit.b “A medieval painting of the wifi not working.”
Text Ability to generate quoted text. “A storefront with ’Deep Learning’ written on it.”
“A sign that says ’Text to Image’.”
a
https://www.merriam-webster.com/topics/obscure-words.
b
https://www.reddit.com/r/dalle2/.

divided into 11 categories (Table 1), which aim to assess various capa­ qualitative evaluation of generated images (e.g.[2]).
bilities of models. These prompts test a model’s ability to accurately The assessment of models through qualitative methods can be sus­
render different attributes, such as colors, object counts, spatial re­ ceptible to errors, potentially leading to an incorrect decision.
lationships, text in the scene, and unusual object interactions. Addi­ Conversely, quantitative metrics may not always align with image
tionally, the categories include complex prompts that incorporate quality. Therefore, the use of both qualitative and quantitative evalua­
lengthy, intricate textual descriptions, as well as uncommon words and tions is typically recommended to obtain a more robust indication when
misspelled prompts. DrawBench was used to directly compare different selecting one model over another.
models, where human evaluators were presented with two sets of im­
ages, each consisting of eight samples, one from Model A and the other 2.2. Deepfake detection methods
from Model B. Evaluators were then asked to compare Model A and
Model B based on sample fidelity and image-text alignment. Detection of deepfakes has become an essential area of research due
Large-scale datasets have also been used in studies that focus on the to the increasing sophistication of deep learning algorithms that can

Fig. 2. Examples of poorly generated faces.

3
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 3. Fake images can be exposed through background cues.

Fig. 4. Here are some instances of eyes that were generated poorly. The eye in the bottom right corner is an actual photograph of a patient who has an irregularly
shaped pupil. You can refer to this link for more details. This case represents a unique manifestation of a condition known as “cat’s eye Adie-like pupil,” which is
considered a warning sign for ICE syndrome.

Fig. 5. Here are some examples of images where the gaze direction is problematic. In these images, one eye appears to be looking in a different direction compared to
the other, similar to a medical condition called Strabismus in the real world. You can check out https://en.wikipedia.org/wiki/Strabismus for additional information
on this topic.

4
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 6. Some samples of generated eyeglasses with poor quality.

Fig. 7. Examples of poorly generated teeth.

detect deepfakes. These methods aim to exploit the fact that deepfakes
lack the natural temporal variations and correlations that are present in
real videos and audios. For instance, the use of convolutional neural
networks (CNNs) has been proposed to detect deepfakes by analyzing
the spatial features of images and videos (e.g.[1,22,23,10]). Similarly,
recurrent neural networks (RNNs) have been proposed to analyze the
temporal features of video and audio data to detect deepfakes [14].
Moreover, Generative Adversarial Networks (GANs) [13] have been
used to generate fake images and videos, but can also be used to detect
them by identifying inconsistencies in the generator’s output [19].
Overall, deepfake detection is a challenging problem due to the rapid
evolution of deep learning algorithms that can generate more realistic
fake content [9]. Thus, a combination of static and dynamic analysis
approaches is necessary to achieve effective detection of deepfakes.
Fig. 8. Clues that can reveal fake ears, here through earrings. Additionally, extensive evaluation and comparison of deepfake detec­
tion methods are essential to identify their effectiveness and limitations
generate highly realistic fake images, videos, and audio. As a result, and to guide future research in this area. To read more about this sub­
numerous deepfake detection methods have been proposed in recent ject, you may want to consult [12,28,24,32] which offer comprehensive
years, ranging from traditional image and video forensic techniques to reviews on the topic.
advanced deep learning-based approaches. These methods can be
broadly categorized into two groups: static and dynamic analysis. 3. Qualitative failures of image generation models
Static analysis methods use handcrafted features to distinguish be­
tween real and fake images. Examples of static analysis methods include We compiled a list of qualitative failures by examining images from
reverse image search, which compares the content of an image to a large various sources including social media websites such as Twitter, Link­
database of known images (e.g.[8]), and error level analysis, which edIn, Discord, and Reddit,1 as well as images from the DiffusionDB
detects inconsistencies in the compression levels of an image [17]. dataset [34].2 These images have been generated by notable generative
Another method is the use of noise patterns and artifacts, which are models such as DALL-E 2, Midjourney, StableDiffusion, and Bing Image
common in images and videos captured by digital cameras and can be Creator. Additionally, we analyzed images from websites such as th
used to identify forgeries. For instance, the sensor pattern noise in im­ isxdoesnotexist.com, whichfaceisreal.com, the Adobe Stock library,
ages captured by digital cameras can be used to authenticate images and and openart.ai. We made sure that the text prompts used to generate
detect tampering attempts [20]. In addition, traditional forensic tech­ images were not intentionally seeking peculiar images. Finally, we
niques such as shadow analysis, lighting analysis, and perspective
analysis can also be used to identify inconsistencies in the shadows,
lighting, and perspectives of images. 1
A few of the images used in this work were obtained with the consent of a
On the other hand, dynamic analysis methods rely on deep neural
Reddit user named Kronzky.
networks to analyze the temporal features of video and audio data to 2
This dataset includes prompts that were used to generate images.

5
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 9. Examples of poorly generated hair.

Fig. 10. Examples of poorly generated skin, absolutely perfect skin with no pores.

manually reviewed the images and filtered out the ones without by taking a quiz at whichfaceisreal.com.
problems. Image Background. When creating generated images and deep­
fakes, issues with the background of the images may arise, particularly
3.1. Human and animal body parts in cases where the face is in focus while the surrounding clues are
incorrect. The neural network used to generate the images focuses
Faces. Since the initial triumphs of GANs, the generation of fake mainly on the face and may not pay as much attention to the sur­
faces has been the most extensively scrutinized category for deep rounding details. This can lead to strange companions or chaotic forms
generative models [5]. Faces are comparatively simpler to generate than in the background. Additionally, the objects or people next to the pri­
complex scenes because they are easier to calibrate. In the past, the first mary person in the image may appear unnatural or “mutant”. Fig. 3
generated faces were effortlessly recognizable by humans. However, displays several instances of failures as examples.
with the advancement of technology such as StyleGAN [16], the latest Eyes and Gaze. Deep generative models have largely overcome is­
examples of generated faces are more challenging to distinguish. Fig. 2 sues with early fake images such as cross-eyed, uncentered or different
illustrates a few faces that were generated with issues. You can evaluate sized pupils, different colored irises, and non-round pupils, as shown in
your ability to distinguish between real and computer-generated faces examples in Fig. 4. Early GANs used to produce pupils that were not

6
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 11. Examples of images with poorly generated limbs and distorted body.

Fig. 12. Issues with AI-generated fingers.

circular or elliptical like those found in real human eyes, which can be a with frame structures often differing between the left and right sides, or
clue that an image is fake. Reflections in the eyes can also be used to with one side having an ornament and the other not. Sometimes the
identify fake images. Other clues include irregularities in pupil shape, frame can appear crooked or jagged. The glasses may partially disappear
although this is not always indicative of a fake image since some diseases or blend with the head, and they can be asymmetrical. The view through
can cause such irregularities. See the example shown in the bottom-right the lens may also be heavily distorted or illogical, and nose pads may be
panel in Fig. 4. missing or distorted. Please see Fig. 6 for some examples.
Unnatural gaze direction or unrealistic eye movements may be Teeth. Rendering teeth is a difficult task for AI, which often results in
observed in deepfakes, which can indicate that a machine learning al­ odd or asymmetric teeth. When someone’s teeth appear unusual or
gorithm generated or manipulated the image. Please see Fig. 5. crooked, there’s a good chance that the image was generated by AI.
Eyeglasses. Algorithms can struggle to create realistic eyeglasses, Semi-regular repeating details like teeth are difficult for models to

7
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 13. Generating realistic clothing is a challenge for generative models.

Fig. 14. Examples of lines, edges, and surfaces that are generated poorly by AI.

generate, causing misaligned or distorted teeth. This problem has also generated ears and earrings are shown in Fig. 8.
been observed in other domains, such as texture synthesis with bricks. Hair and Whiskers. The style of hair can differ greatly, which also
Occasionally, an image may display an excessive number of teeth or means there is a lot of intricate detail to capture. This makes it one of the
teeth with abnormal shapes and colors, and in some instances, there may most challenging aspects for a model to render accurately. The gener­
be an insufficient number of incisors. Please see Fig. 7 for some ated images may contain stray strands of hair in unusual places, or the
examples. hair may appear too straight or streaked. Occasionally, the image may
Ear and Earrings. Ears in AI-generated images may exhibit dis­ resemble acrylic smudges from a palette knife or brush. Another issue
crepancies such as differences in size, one ear appearing higher or bigger may be a strange glow or halo around the hair. In some cases, the model
than the other, or missing or partially missing earrings. Additionally, may bunch hair in clumps or create random wisps around the shoulders,
earrings may be randomly shaped or not match visually. If earrings are while also including thick stray hairs on the forehead. Please see Fig. 9.
asymmetrical or have different features such as one having an attached Skin. Deepfakes can be deficient in delicate details and subtleties
earlobe while the other doesn’t or one being longer than the other, it’s found in genuine images, like skin texture, pores, or fine lines on
likely that the image has been generated by AI. Examples of poorly someone’s face. The skin tone in deepfakes may appear unnatural or

8
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 15. Examples of generated images that exhibit issues with perspective.

inconsistent, such as a person’s face appearing too pale or too red. somewhat uneven in generated images. Some samples failures are
Additionally, deepfakes may lack the presence of noise or grain which shown in Fig. 14.
exists in real images, giving a sense of texture and realism. Without the Perspective. Models lack the ability to understand the 3D world,
presence of noise or grain, deepfake images may seem excessively clean which results in physically impossible situations when objects cross
or artificial. Some example failures are shown in Fig. 10. different planes in a scene. These errors are difficult to detect as our
Limbs, Hands, and Fingers. The models used for generating deep­ brain often auto-corrects them, requiring a conscious investigation of
fakes often fall short when it comes to accurately depicting the intricate each angle of the object to identify inconsistencies. Generated images
details of human extremities. For instance, hands may randomly dupli­ can display an unnatural or distorted perspective, where a person’s body
cate, fingers can merge together or there may be too many or too few of appears stretched or compressed unrealistically. They may also have
them, and third legs may unexpectedly appear while existing limbs may inconsistent or unrealistic camera angles, where a person’s face appears
disappear without a trace. Furthermore, limbs may be positioned in to be viewed from an impossible angle or perspective. Some example
unrealistic or impossible poses, or there may be an excess number of failures are shown in Fig. 15.
them. As a result, deepfakes may exhibit unnatural body language, such Symmetry. Due to difficulty managing long-distance dependencies
as unrealistic gestures or postures that are out of place. See Figs. 11 and in images, symmetry (reflection, radial, translation, etc) can be chal­
12. lenging for models. For instance, in generated images, eyes may appear
Clothing. Generative models may produce distorted clothing with heterochromatic and crosseyed, unlike in real life where they tend to
various issues, such as asymmetrical, peculiar, or illogical textures or point in the same direction and have the same color. Additionally,
components such as zippers or collars merging with the skin, and tex­ asymmetry may appear in facial hair, eyeglasses, and the types of collar
tures abruptly changing or ending. Please refe to Fig. 13 for some of such or fabric used on the left and right sides of clothing. Models may face
failures. challenges in maintaining symmetry not only in faces but also in other
objects and scenes. For instance, two shoes in a pair or wings in an
airplane might not be exactly the same. This is a type of reasoning glitch
3.2. Geometry where the model cannot understand that certain elements should be
symmetrical. Some example failures are shown in Figs. 16 and 17.
Generated images may exhibit anomalous or atypical image geom­ Relative Size. Relative size is a visual perceptual cue that helps us
etry, with objects appearing to be of an unusual shape or size, in com­ understand the size of objects in relation to one another. It is a powerful
parison to their expected proportions. cue because it allows us to estimate the size of objects even when we do
Straight Lines and Edges. AI-generated images may lack the not have any absolute size reference in the scene. Models, however, fall
straight lines, seams, and connections found in real-world objects, short in synthesizing objects with objects with sizes proportional to their
resulting in wavy, misaligned, and jumpy renderings (e.g. in tiles). size in the real world. Some example failures are shown in Fig. 18.
Generated images can also exhibit inconsistent or unnatural image Other Geometry. Generated images exhibit various geometrical
edges, which refer to the boundaries between different parts of the anomalies that may reveal their artificiality. For instance, their depth
image. Further, surfaces, which are typically straight, may look

9
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 16. Examples of generated images that display inconsistent symmetry.

cues can be inconsistent or unnatural, causing the foreground or back­ suggest that certain parts of the image have been manipulated. The
ground to seem blurry or devoid of detail. Moreover, they often lack absence of occlusion, i.e., the overlapping of objects in the scene, is
parallax, which is the apparent displacement of objects when viewed another telltale sign of generated images, as it can make the image look
from different perspectives, resulting in a flat or two-dimensional flat or unrealistic. Lastly, generated images may display improper image
appearance. Additionally, incorrect or inconsistent motion blur may alignment, with objects seeming misaligned or out of place.

10
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 17. Additional examples of generated images that exhibit inconsistent symmetry.

Fig. 18. Examples of images where there is a violation of relative size.

3.3. Physics images may display other lighting effects that do not match real-world
environments, such as lens flares, lens distortion, chromatic aberration
Generated images that violate physics rules exhibit various cues that and unnatural specular highlights. These effects are frequently observed
can give them away as unrealistic or physically impossible. These cues in genuine photographs due to the physical properties of camera lenses
include objects appearing to float in mid-air without support, shadows and the way light is refracted through them. Some example failures are
that are inconsistent with the light source, reflections or refractions that shown in Fig. 19.
break the laws of optics, objects passing through each other without Shadow. Generated images might not include shadows, which are
interaction, and incorrect physics-based simulations such as fluids or typically found in real images and contribute to the impression of depth
cloth that behave in impossible ways. By identifying these cues, it is and authenticity. It is important to observe objects without shadows and
possible to identify and distinguish realistic images from those that those with highlights that appear to originate from a different direction
violate the rules of physics. than the rest of the image. Additionally, if the photo was taken outdoors
Reflection. An effective technique for detecting generated images is in natural light during the afternoon, the setting sun will produce longer
to examine the lighting and how it interacts with the elements within the shadows than it would at midday, which can be easily identified by
image, and how it causes reflections and shadows. Generated images can scrutinizing the shadow’s length. However, this method may not be as
exhibit artificial reflections that are inconsistent with the natural precise in artificial lighting conditions. Finally, if there are multiple
lighting and environment, such as those in glasses, mirrors, or pupils. objects or people within the scene, their shadows should be consistent
The root cause of this issue is that deep generative models lack a proper with each other. Some generated images with inconsistent shadows are
understanding of reflections. While these models may recognize that an shown in Fig. 20.
image contains a reflection and typically involves two people (one facing Objects without Support. When an object or material appears to be
the camera and the other with their back turned), they do not compre­ floating in mid-air without any visible means of support, it gives the
hend that the two individuals are, in fact, the same person. Generated impression that the object is defying gravity or the laws of physics. In

11
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 19. Generated images with inconsistent reflections.

Fig. 20. Generated images with inconsistent shadows.

reality, all objects are subject to the force of gravity unless they are held that fails to account for the gravitational force. This type of inconsis­
up by some other force. When an object appears to be floating, it could tency can cause a generated image to look unrealistic or implausible.
be a result of an incorrect rendering or an error in the physics simulation Some example failures are shown Fig. 21.

12
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 21. Generated images where some objects lack visible physical support. Some objects are suspended in mid-air without any explanation or justification. This
lack of physical support could result from a failure to properly simulate or model the forces acting on the objects in the scene.

Fig. 22. Samples of spatial reasoning from [21]. Images are generated by DALL-E 2 for the following text prompts for columns from left to right: “a red basketball
with flowers on it, in front of blue one with a similar pattern”, “a red ball on top of a blue pyramid with the pyramid behind a car that is above a toaster”, “a pear cut
into seven pieces arranged in a ring, “In late afternoon in January in New England, a man stands in the shadow of a maple tree”, and “An old man is talking to
his parents”.

3.4. Semantics and logic in potential difficulties in these areas. For example, when tasked with
generating an image of the solar system drawn to scale, a generative
Images produced by generative models may lack the semantic model may struggle to maintain the correct planetary order, as
meaning or contextual relationships present in authentic images. These demonstrated here.
models tend to focus on the nouns in a given prompt and construct a Spatial Reasoning. Natural scenes are complex and contain a wide
plausible scene based on them, potentially failing to capture the true range of spatial relationships among objects, such as occlusions, relative
relationships between objects. It is crucial to bear in mind that AI lacks distances, and orientations. Capturing these relationships requires the
an inherent understanding of the world and can only process informa­ model to have a nuanced understanding of the scene and the objects
tion in terms of shapes and colors. Complex concepts, such as logical within it, which can be difficult to achieve without more explicit guid­
connections and three-dimensional space, are beyond its grasp, resulting ance. Furthermore, some image generation models rely solely on pixel-

13
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 23. Generated images with problems with context and scene composition.

Fig. 24. Additional generated images that exhibit semantic issues.

Fig. 25. Generative images that exhibit issues or inconsistencies with the text.

14
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 26. Top row: problems with color and noise in generated images. Bottom row: fluorescent colors sometimes bleed in from background onto the hair or face.

Fig. 27. Some generated images that look cartoonish or look like paintings.

level reconstruction, without explicitly modeling the underlying se­ model to understand the relationships between the text and the visual
mantics or spatial relationships. In these cases, the model may generate content of the image. This can be challenging because the text and image
images that are visually realistic but lack coherent semantic meaning or data have different structures and are not directly aligned with each
accurate spatial relationships among objects. Please see Fig. 22 for some other. Additionally, text can appear in various locations and orientations
examples. within an image, and the context of the text may change depending on
Context and Scene Composition. Generated images can be detec­ the surrounding visual content. Furthermore, generating text that
ted through various inconsistencies such as the background or sur­ accurately describes the visual content of an image requires a deep
roundings not matching the real-world environment, cardinality/ understanding of the semantics and context of both the text and the
counting, missing contextual details, unnatural object placement, and image. While some progress has been made in recent years with the
inconsistent image composition. These irregularities may include in­ development of methods such as image captioning, it is still an active
consistencies in order of objects, missing objects or features, objects area of research to develop generative models that can effectively
appearing in the wrong location or orientation, or unnatural arrange­ generate text in images. Fig. 25 displays instances where the text is
ment and placement of objects in the image. Please see Fig. 23. incomprehensible. In such cases, the letters appear scrambled or
Other Semantics. Fig. 24 depicts several additional generated im­ duplicated, and the words are spelled incorrectly.
ages that exhibit semantic issues. For instance, one image features a Noise, Color, and Blur Artifacts. Digital distortion in the form of
person with his head and feet pointing in opposite directions, while pixelation or imperfect coloring can be present in generated images,
another displays a fragmented pizza that does not cohere into a single particularly around the image edges. Monochrome areas may display
entity. In yet another image, a blank painting hangs on the wall, creating semi-regular noise with horizontal or vertical banding, potentially due
a confusing and nonsensical composition. to the network attempting to replicate cloth textures. Older GANs tend
to produce a more noticeable checkerboard noise pattern. Other telltale
signs of generated images include inconsistencies in color or tone,
3.5. Text, noise, and details
oversaturation or undersaturation of colors, and unnatural image noise
patterns. See the top row in Fig. 26. Fluorescent bleed, where bright
Text. Generating text and logos in images requires the generative

15
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 28. Generated images with flawed details.

Fig. 29. Example failures of generated complex scenes. Achieving accurate and detailed rendering in these types of images is particularly difficult due to the large
number of objects and the intricate relationships between them.

16
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 30. Generated crowd scenes with issues.

colors bleed onto the hair or face of a person in the image from the that these types of images contain many repeated patterns, which can be
background, is also a potential indicator of a generated images (the difficult for the model to accurately capture. Several examples of failed
bottom row in Fig. 26). The human attention system is naturally adept at attempts to generate these objects can be seen in Figs. 31 and 32. This
quickly recognizing these patterns, making them useful tools for iden­ list of challenging objects can be used to assess and compare the per­
tifying generated images. formance of different image generation models.
Images with Cartoonish Look. AI generated images may look
cartoonish or may look like a painting. This could be due to several 4.3. Memorization and copyright
reasons such as inconsistent or unnatural image texture, lack of depth, or
focus. Some examples are shown in Fig. 27. As previously mentioned, a method for identifying whether an image
Fine-grained Details. AI-generated images may contain technical is generated or not is through reverse image search. Generative models
details that are either incorrect or appear as random shapes. For may memorize images partially or in their entirety, as seen in the ex­
example, furniture legs can be particularly challenging for AI to accu­ amples presented in Fig. 33. This phenomenon has raised concerns
rately render, resulting in incorrect numbers of legs or physically regarding copyright infringement, as generated images may include
impossible configurations. These issues can be attributed to the inherent watermarks from the original images. For more information on this
difficulty of modeling complex objects and the limitations of the AI’s issue, please refer to the this link. Please also see Fig. 34.
understanding of real-world. Some example failures are shown in
Fig. 28.
4.4. Failure modes from other studies
Accurately rendering all details in complex scenes or crowd scenes,
such as those depicted in Figs. 29 and 30, can be particularly challenging
Certain image generation techniques may incorporate failure models
for AI. The complexity of these scenes makes it difficult for the AI to
to provide readers with a more comprehensive understanding of their
accurately model every detail and can lead to errors in object placement,
models’ limitations. For instance, the creators of the Parti image
lighting, perspective, and other features. Despite the challenges, AI
generator [35]3 have presented some examples of such failure cases,
technology continues to improve, and advancements are being made in
which are illustrated in Fig. 35. These failure cases can be categorized
the generation of more realistic and believable large and crowd scenes.
into the errors discussed earlier. It is recommended that researchers in
this field consider including a discussion of their models’ failure models
4. Discussion
as a best practice.

4.1. Other cues


5. Conclusion and future work
In addition to the cues discussed above, there are several other in­
This paper lists several qualitative indicators for identifying gener­
dicators that can be used to identify generated images and deepfakes.
ated images and deepfakes. These indicators not only enable us to
One such method involves examining the metadata of an image or
address the issue of fake images but also underscore the differences
conducting a reverse Google search to verify its authenticity. Addi­
between generated and real-world content [7]. Furthermore, they serve
tionally, common sense can be applied to detect images that are likely to
as a checklist for evaluating image generation models.
be generated, such as a shark swimming down a street or aliens eating
It should be noted that as algorithms improve, some of these clues
sushi in a Chinese restaurant. Other indications of generated images and
may become obsolete over time. However, this does not mean that these
deepfakes include lack of motion blur, unnatural bokeh, all objects
models will not make any of these mistakes in generating images. It may
appearing in focus, and repeated patterns in the image.
be necessary to use a combination of these indicators to identify
generated images, as there is no one-size-fits-all solution.
4.2. Some challenging objects Image generation models are becoming increasingly widespread and
accessible. However, in the wrong hands, these algorithms can be used
Generative models face particular challenges when it comes to to create propaganda and other forms of fake media. In a world rife with
generating images of objects such as clocks, Lego houses, chessboards,
carpets, circuit boards, basketballs, glasses of water, dice, diagrams and
tables, keyboards, and computer screens. One of the reasons for this is 3
https://parti.research.google/.

17
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 31. Some object that are difficult for models to generate.

fake news [18], we have learned not to believe everything we read. Now, informal investigation, we were able to use some of these indicators to
we must also exercise caution when it comes to visual media. The detect fake faces with high accuracy in the quiz available on whichfacei
blurring of lines between reality and fiction could transform our cultural sreal.com. Subsequent research can assess the extent to which these cues
landscape from one primarily based on truth to one characterized by contribute to the detection of generated images and deepfakes by con­
artificiality and deception. As we have demonstrated with the set of cues ducting behavioral experiments involving human participants.
presented here, it is possible to identify fake images. In fact, in an Although visual inspection can be useful in identifying generated

18
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 32. Additional challenging objects for models to generate.

Fig. 33. The images on the left side of each pair are generated by StableDiffusion. One pair shows an oil painting of American Gothic by Hieronymus Bosch, while the
other pair depicts The Ghosts of Hokusai.

images, it may not be comprehensive enough to detect all types of comprehensive strategy. Moreover, it is vital to stay informed about the
generated content. Thus, integrating alternative approaches such as latest advancements and techniques in this field, as it is continuously
machine learning algorithms or forensic analysis can provide a more evolving.

19
A. Borji Image and Vision Computing 137 (2023) 104771

Fig. 34. Images that violate copyright generated by StableDiffusion.

Fig. 35. Sample failure of the Parti image generation model. Please refer tohere to see high resolution images.

In this study, we focused on still images. However, for videos, interests or personal relationships that could have appeared to influence
additional indicators beyond those outlined here, such as motion and the work reported in this paper.
optical flow, as well as the synchronization of lip, face, and head
movements over time, can also be significant factors [3]. One can un­ Data availability
dertake comparable initiatives to investigate indicators for identifying
counterfeit audio. Educating individuals on the cues outlined in this No data was used for the research described in the article.
paper may aid in combating deepfake proliferation. It would be
worthwhile to investigate whether individuals can be effectively trained References
to become experts in this area.
[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, Isao Echizen, Mesonet: a
compact facial video forgery detection network, in: 2018 IEEE international
Declaration of Competing Interest workshop on information forensics and security (WIFS), IEEE, 2018, pp. 1–7.
[2] Yannick Assogba, Adam Pearce, Madison Elliott, Large scale qualitative evaluation
of generative image model outputs, arXiv preprint arXiv: 2301.04518, 2023.
The authors declare that they have no known competing financial

20
A. Borji Image and Vision Computing 137 (2023) 104771

[3] Matyáš Boháček, Hany Farid, Protecting world leaders against deep fakes using [22] Huaxiao Mo, Bolin Chen, Weiqi Luo, Fake faces identification via convolutional
facial, gestural, and vocal mannerisms, Proc. Nat. Acad. Sci. 119 (48) (2022), neural network, in: Proceedings of the 6th ACM workshop on information hiding
e2216035119. and multimedia security, 2018, pp. 43–47.
[4] Ali Borji, Pros and cons of gan evaluation measures, Comput. Vis. Image Underst. [23] Lakshmanan Nataraj, Tajuddin Manhar Mohammed, Shivkumar Chandrasekaran,
179 (2019) 41–65. Arjuna Flenner, Jawadul H. Bappy, Amit K. Roy-Chowdhury, B.S. Manjunath,
[5] Ali Borji, Generated faces in the wild: Quantitative comparison of stable diffusion, Detecting gan generated fake images using co-occurrence matrices, arXiv preprint
midjourney and dall-e 2, arXiv preprint arXiv: 2210.00586, 2022. arXiv: 1903.06836, 2019.
[6] Ali Borji, Pros and cons of gan evaluation measures: New developments, Comput. [24] Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc
Vis. Image Underst. 215 (2022), 103329. Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-
[7] Ali Borji, A categorical archive of chatgpt failures, arXiv preprint arXiv: Viet Pham, Cuong M. Nguyen, Deep learning for deepfakes creation and detection:
2302.03494, 2023. A survey, Comput. Vis. Image Underst. 223 (2022), 103525.
[8] Yixin Chen, Vassil Roussev, G. Richard, Yun Gao, Content-based image retrieval for [25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
digital forensics, in: Advances in Digital Forensics: IFIP International Conference Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et
on Digital Forensics, National Center for Forensic Science, Orlando, Florida, al., Learning transferable visual models from natural language supervision, in:
February 13–16, 2005 1, Springer, 2005, pp. 271–282. International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[9] Bobby Chesney, Danielle Citron, Deep fakes: A looming challenge for privacy, [26] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen,
democracy, and national security, Calif. L. Rev. 107 (2019) 1753. Hierarchical text-conditional image generation with clip latents, arXiv preprint
[10] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian Riess, Matthias Nießner, arXiv: 2204.06125, 2022.
Luisa Verdoliva, Forensictransfer: Weakly-supervised domain adaptation for [27] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss,
forgery detection, arXiv preprint arXiv: 1812.02510, 2018. Alec Radford, Mark Chen, Ilya Sutskever, Zero-shot text-to-image generation, in:
[11] Luka Dragar, Peter Peer, Vitomir Štruc, Borut Batagelj, Beyond detection: Visual International Conference on Machine Learning, PMLR, 2021, pp. 8821–8831.
realism assessment of deepfakes, arXiv preprint arXiv: 2306.05985, 2023. [28] Judith A. Redi, Wiem Taktak, and Jean-Luc Dugelay. Digital image forensics: a
[12] Jessica Fridrich, Digital image forensics, IEEE Signal Process. Mag. 26 (2) (2009) booklet for beginners, Multimed. Tools Appl. 51 (2011) 133–162.
26–37. [29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily
[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Xu. Bing, David Warde-Farley, L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan,
Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative adversarial networks, Tim Salimans, et al., Photorealistic text-to-image diffusion models with deep
Commun. ACM 63 (11) (2020) 139–144. language understanding, Adv. Neural Inf. Process. Syst. 35 (2022) 36479–36494.
[14] David Güera, Edward J Delp, Deepfake video detection using recurrent neural [30] Mehdi S.M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, Sylvain Gelly,
networks, in: 2018 15th IEEE international conference on advanced video and Assessing generative models via precision and recall, Adv. Neural Inf. Process. Syst.
signal based surveillance (AVSS), IEEE, 2018, pp. 1–6. 31 (2018).
[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, [31] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,
Sepp Hochreiter, Gans trained by a two time-scale update rule converge to a local Xi Chen, Improved techniques for training gans, Adv. Neural Inf. Process. Syst. 29
nash equilibrium, Adv. Neural Inf. Process. Syst. 30 (2017). (2016).
[16] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo [32] Luisa Verdoliva, Media forensics and deepfakes: an overview, IEEE J. Select. Top.
Aila, Analyzing and improving the image quality of stylegan, in: Proceedings of the Signal Process. 14 (5) (2020) 910–932.
IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. [33] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, Alexei A. Efros,
8110–8119. Cnn-generated images are surprisingly easy to spot…for now, in: Proceedings of
[17] Eric Kee, Micah K. Johnson, Hany Farid, Digital image authentication from jpeg the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.
headers, IEEE Trans. Inf. Forensics Secur. 6 (3) (2011) 1066–1075. 8695–8704.
[18] David M.J. Lazer, Matthew A. Baum, Yochai Benkler, Adam J. Berinsky, Kelly [34] Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover,
M. Greenhill, Filippo Menczer, Miriam J. Metzger, Brendan Nyhan, Duen Horng Chau, Diffusiondb: A large-scale prompt gallery dataset for text-to-
Gordon Pennycook, David Rothschild, et al., The science of fake news, Science 359 image generative models, arXiv preprint arXiv: 2210.14896, 2022.
(6380) (2018) 1094–1096. [35] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang,
[19] Yuezun Li, Siwei Lyu, Exposing deepfake videos by detecting face warping Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al., Scaling
artifacts, arXiv preprint arXiv: 1811.00656, 2018. autoregressive models for content-rich text-to-image generation, arXiv preprint
[20] Jan Lukas, Jessica Fridrich, Miroslav Goljan, Digital camera identification from arXiv: 2206.10789, 2022.
sensor pattern noise, IEEE Trans. Inf. Forensics Secur. 1 (2) (2006) 205–214. [36] Yu Zeng, Huchuan Lu, Ali Borji, Statistics of deep generated images, arXiv preprint
[21] Gary Marcus, Ernest Davis, Scott Aaronson, A very preliminary analysis of dall-e 2, arXiv: 1708.02688, 2017.
arXiv preprint arXiv: 2204.13807, 2022.

21

You might also like