Nothing Special   »   [go: up one dir, main page]

Automatically Detecting Criminal Identity Deception An Adaptive Detection Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

988 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO.

5, SEPTEMBER 2006

Automatically Detecting Criminal Identity


Deception: An Adaptive Detection Algorithm
G. Alan Wang, Student Member, IEEE, Hsinchun Chen, Fellow, IEEE, Jennifer J. Xu, and Homa Atabakhsh

Abstract—Identity deception, specifically identity concealment, They can be used to detect deceptive identities by finding
is a serious problem encountered in the law enforcement and intel- records that are similar but not exactly the same. However, most
ligence communities. In this paper, the authors discuss techniques of these techniques are ad hoc and cannot be easily applied
that can automatically detect identity deception. Most of the ex-
isting techniques are experimental and cannot be easily applied to real deception detection applications because of problems
to real applications because of problems such as missing values such as missing values and large volumes of data. Because a
and large data size. The authors propose an adaptive detection police database usually contains millions of criminal identity
algorithm that adapts well to incomplete identities with missing records, the detection techniques need to be efficient and scal-
values and to large datasets containing millions of records. The able enough to examine all deceptive identities. In addition,
authors describe three experiments to show that the algorithm is
significantly more efficient than the existing record comparison for any large dataset, it is “unlikely that complete information
algorithm with little loss in accuracy. It can identify deception will be present in all cases” [23]. Missing values contained in
having incomplete identities with high precision. In addition, it past criminal records may greatly affect the accuracy of the
demonstrates excellent efficiency and scalability for large data- detection techniques in finding deceptive identities because of
bases. A case study conducted in another law enforcement agency the reduced information.
shows that the authors’ algorithm is useful in detecting both
intentional deception and unintentional data errors. In this paper, we aim to develop an automated approach
that looks for inexact matches for fabricated identities. Such a
Index Terms—Efficiency, identity deception, missing value, technique is expected to search through past criminal identity
scalability.
records that may contain missing values and to be efficient
enough to handle large volumes of data. In Section II, we
I. I NTRODUCTION
briefly discuss the identity deception problem and review some

I DENTITY deception occurs when someone intentionally


conceals his/her original identity, impersonates another in-
dividual’s identity, or uses forged identity documents. One of
existing deception detection techniques. We also review tech-
niques that handle the missing value problem and those that
improve algorithm efficiency and scalability. We present our
the problems that identity deception may cause is financial research questions in Section III. In Section IV, we propose an
loss. For example, the U.K. reports financial losses of at least adaptive detection algorithm for identity deception problems.
£1.3 billion each year due to identity deception [1]. More im- This algorithm is able to utilize records containing missing
portantly, criminals or terrorists using false identities may cause values and is scalable to large volumes of identity data. We
casualties and property damages too large to be quantifiable. describe our experimental design in Section V and report the
Thus, the identity deception problem has become a central issue results and discussions in Section VI. We conclude our findings
in law enforcement and intelligence agencies. and future directions in the last section.
A fabricated identity is difficult for law enforcement or intel-
ligence agents to uncover. Police officers often rely on computer II. R ELATED W ORK
systems to search a suspect’s identity against history records
A. Identity Deception
in police databases. Generally, computer systems search using
exact match queries. Even if the fabricated identity is similar to Identity is a set of characteristic elements that distinguish
the original identity recorded in the law enforcement computer a person from others [12], [22]. There are three types of
system, an exact-match query is unlikely to bring up that record. basic identity components, namely: 1) attributed identity;
Techniques to perform inexact searches have been developed. 2) biometric identity; and 3) biographical identity [1], [9].
Attributed identity is the information given to a person at birth,
such as name and date and place of birth. Biometric identity
Manuscript received November 18, 2003; revised July 23, 2004. This contains biometric features that are unique to a person, such
work was supported by the National Science Foundation, Digital Government
Program, “COPLINK Center: Social Network Analysis and Identity Decep- as fingerprints. Information that builds up over a life span
tion Detection for Law Enforcement and Homeland Security,” IIS-0429364, comprises a person’s biographical identity, examples of which
2004–2006. are credit history and crime history. Among these three types
G. A. Wang, H. Chen, and H. Atabakhsh are with the Department of
Management Information Systems, University of Arizona, Tucson, AZ 85721 of identity components, attributed and biographical identities
USA (e-mail: gang@eller.arizona.edu; hchen@eller.arizona.edu; homa@eller. are often subject to deception, whereas biometric features of a
arizona.edu). person are the most difficult to falsify.
J. J. Xu is with the Department of Computer and Information Systems,
Bentley College, Waltham, MA 02452 USA (e-mail: xu@bentley.edu). Deception is “a sender’s knowingly transmitting messages
Digital Object Identifier 10.1109/TSMCA.2006.871799 intended to foster a false belief or conclusion in the receiver”
1083-4427/$20.00 © 2006 IEEE
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: AUTOMATICALLY DETECTING CRIMINAL IDENTITY DECEPTION 989

Fig. 1. Taxonomy of identity deception. Each percentage number represents the proportion of cases that contain the particular type of deception.

[7]. This definition originates from the interpersonal communi- [e.g., the Social Security Number (SSN)]. Name concealment,
cation perspective and also applies to identity deception that occurring in most deceptive cases, includes giving a false
usually occurs in an interactive environment (e.g., during an first name and a true last name or vice versa, changing the
interrogation). We categorize three types of identity deception middle initial, giving a name pronounced similarly but spelled
based on the method of deception, namely: 1) identity conceal- differently, etc. Concealment made on DOB can consist of,
ment; 2) identity theft; and 3) identity forgery. e.g., switching places between the month of birth and the day
Identity concealment is deceiving by omitting or changing of birth. Similarly, ID deception is often made by changing a
details of the true identity [11]. For example, a person may few digits of a social security number or by switching their
report his birth date with an altered month or day or provide places. In residency deception, criminals usually change only
a false first name along with his true last name. This type of one portion of the address. For example, the case study found
deception is popular when a subject unexpectedly encounters that in about 87% cases, subjects provided a false street number
a law enforcement officer [15]. Concealment could be more along with the true street direction, name, and type.
advantageous than using a completely fictitious identity to those Based on this case study, we observed that a concealed iden-
who lie about their identities. Subjects may recall partially true tity often partially matched with its original identity. We studied
information more easily than a completely fictitious identity whether a certain technique could utilize such a characteristic
when questioned repeatedly because the true part of the con- and automatically detect this type of identity deception. In the
cealed information serves as recall cues and cued recall may next section, we review techniques that can be used to detect
reconstruct memory better than recall without cues (i.e., free identity deception.
recall) [10]. Hence, the difficulty of recognizing such a decep-
tion (e.g., by law enforcement agents) is substantially increased.
B. Deception Detection Techniques
Identity theft, also called impersonation, is the action of one
person illegally using another person’s identity information for Detection techniques for general deception have been de-
fraudulent purposes. Credit card fraud is a good example of veloped in the behavioral research fields, such as psychol-
identity theft. Identity forgery is committed through the use of ogy, physiology, and communication. Techniques include the
forged or faked identity documents such as birth certificates, analysis of verbal cues (symptoms of verbal content that are
social security cards, and passports. This is common for illegal used to determine truth and deception), observing nonverbal
aliens who need forged documents to stay unnoticed and, yet, cues (indications conveyed through nonverbal communication
make a living [37]. channels such as facial expression), and measuring physio-
In this paper, we mainly focus on the problem of identity logical reactions (e.g., polygraph lie detector) [3], [14], [38].
concealment. We believe a solution to this problem can greatly However, detection results from these techniques are quite
improve crime investigation by law enforcement and intelli- unreliable [11], [13], [24], [25]. Moreover, these techniques are
gence agencies. We also hope that the solution proposed will not automated processes and require human operators.
be of value in detecting identity theft as well as forgery. Practical detection techniques for identity deception are
We provided evidence for the existence of identity conceal- developed in law enforcement and intelligence communities.
ment in [39], in which a taxonomy of identity deception (Fig. 1) First, police officers often use techniques such as repeated
was built upon a case study of real criminal identity deception. questioning and detailed questioning to validate the truthfulness
We found that deception mostly occurs in specific attributes, of a suspect’s identity. During the questioning process, incon-
namely, name, address, date of birth (DOB), and ID number sistent answers may disclose a false identity. However, those
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
990 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 5, SEPTEMBER 2006

questioning methods are not reliable techniques, especially which can be predetermined by a training process, the algorithm
when dealing with good liars. Consequently, many deceptive suggests that one identity is a deceptive form of the other. Ex-
records still exist in law enforcement databases. Second, after periments showed that this algorithm achieved high detection
talking to the crime analysts of Tucson Police Department accuracy (94%). However, this method is quite inefficient for
(TPD), we find that professional crime analysts can sometimes large-scale datasets. The computational time complexity of the
detect deceptive identities using link analysis techniques. By algorithm is O(N 2 ) because it compares each pair of records in
examining associations among criminals, organizations, and a dataset. The computational time will increase exponentially
vehicles, a crime analyst is able to build criminal networks. as the size of the dataset increases. Furthermore, this method is
When information about a suspect’s identity is incompatible unable to deal with identities that do not have values in all four
with known relationships represented in the criminal networks, fields (i.e., containing missing values).
the identity will be flagged as a possible deception. This tech- The record comparison algorithm works better than data
nique, however, requires great amounts of manual information association algorithms for detecting identity deception because
processing and is very time-consuming. In fact, it often serves it specifically captures the concealment deception patterns
as a postinvestigative tool rather than a proactive investigation defined in the taxonomy introduced in the previous section.
technique. However, the problems with the record comparison algorithm,
Some techniques that were initially designed for crime analy- namely, the inability to handle missing values and the ineffi-
sis can be used to detect identity deception. These techniques ciency in processing large data volumes, prevent it from being
basically perform data association that links suspects to the used in any real-world applications. In the next two sections,
crime being investigated, ordered from the most possible to we review techniques that handle the missing value problem
the least possible. Brown and Hagen [5] proposed a similarity- and methods that improve the algorithm efficiency.
based data association method for associating records of the
same suspect or incidents having similar modus operandi (MO).
C. Missing Value Problem
It compares corresponding description attributes of two records
and calculates a total similarity measure between the two Missing values are defined as values excluded from arith-
records. Experiments showed that associations suggested by the metic calculations because they are missing [8]. In statistical
algorithm agreed with those made by experts. Both techniques analysis and data mining fields, there are three major types of
introduced above are automated processes and can be used strategies that deal with the missing value problem, namely:
to detect identity deception by associating a suspect’s identity 1) deletion; 2) imputation; and 3) adaptive data analysis.
with past criminal records. However, these methods only define Deletion (listwise or pairwise deletion) [6], [16], [18], [23]
similarity measures for categorical (e.g., hair color) and quanti- is the simplest technique to overcome the missing value prob-
tative (e.g., height) attributes, but not for textual noncategorical lem and is easy to implement. Listwise deletion deletes or
attributes such as name and address. ignores those data records where missing values occur. Pairwise
A record comparison algorithm specifically targeting the deletion only excludes records missing information on the
detection of identity deception was proposed in our previous variables under examination [17]. Both approaches may result
paper [39]. This automated detection method makes use of in a great amount of information loss if the fraction of missing
string comparison techniques and searches for inexact matches values is high [17], [40]. Also, deletion methods may lead to
of suspects’ identities in police databases. This technique ex- serious statistical biases if the missing values are not randomly
amines the attributes of name, address, DOB, and SSN for each distributed [35].
identity. It computes a disagreement measure between values Another alternative is imputation, which fills in missing val-
in each corresponding attribute of two identities and calculates ues with plausible estimates [2], [35]. Such a technique makes
an overall disagreement value between the two identities as an use of patterns or statistical associations found in complete
equally weighted sum of the attribute disagreement measures. records. These patterns are then applied to records with missing
The formula for the overall disagreement value is as follows: values, making estimates of the missing values in each record
 based on known attribute values. For example, mean imputation
d2Name + d2Addr + d2SSN + d2DOB [33] replaces a missing value with the mean of nonmissing
d= (1) values of the same attribute. Some imputation methods can be
4
complex due to the process of finding statistical patterns [31].
where dName , dAddr , dSSN , and dDOB represent the disagree- However, imputation techniques can only make estimates on
ment measures in the fields of name, address, SSN, and DOB, numeric or categorical attributes, upon which statistical patterns
respectively. Each field value is considered a string of charac- can be built. Textual attributes, such as names or addresses,
ters. Disagreement between two field values is computed by a can hardly be estimated. Another disadvantage of imputation
string comparator, namely, the Levenshtein edit distance [26], methods is potentially biasing datasets by treating artificially
which calculates the minimum number of single-character in- imputed values as real ones in subsequent data analysis [30].
sertions, deletions, and substitutions required to transform one In cases where imputation methods cannot reasonably esti-
string to the other. Dividing the edit distance by the length of the mate, adaptive data analysis methods are usually developed to
longer string, each disagreement value is normalized between 0 minimize the impact of missing values. Timm and Klawonn
and 1. If an overall disagreement value d between a suspect’s [36] gave an example with the fuzzy c-means clustering algo-
identity and a past identity record is less than a threshold, rithm, in which missing values were omitted and known ones
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: AUTOMATICALLY DETECTING CRIMINAL IDENTITY DECEPTION 991

were taken into account in calculating the center of each clus- compare each newly entering record with all existing records
ter. Quinlan [32] developed an adaptive approach for missing in the window. If there are duplicate records existing in the
values in decision tree problems. He reduced the information window, the newly entering record only compares with one
gain from testing an attribute A by the proportion of cases with of them and others are ignored. Therefore, the actual number
missing values of A. Experiments showed that this approach of comparisons w that a newly entering record makes within
performed better than that of dropping all incomplete cases (i.e., the window varies. The time complexity of this algorithm is
listwise deletion). O(w N ), where w is usually less than the window size w.
In conclusion, listwise or pairwise deletion is not always Consequently, this adaptive detection method is much more
desirable because they lead to great information loss when there efficient than the SNM. Experiments showed that the detection
are many missing values. For the problem of identity decep- accuracies of both methods were similar [28].
tion, imputation methods are not appropriate because identity
attributes such as names and addresses are textual attributes to III. R ESEARCH Q UESTIONS
which imputation techniques simply do not apply. Therefore, an
adaptive data analysis method suitable for our scenario needs In this paper, we aim to develop a technique that can automat-
to be developed to fully utilize the known attribute values and ically detect deceptive criminal identities in law enforcement
minimize the impact of those that are unknown. and intelligence databases in an effective and efficient way.
Such a technique is applicable to the following law enforcement
scenarios.
D. Algorithm Efficiency and Scalability 1) Given a suspect’s possibly false identity, the algorithm
The efficiency and scalability problem impacts many algo- is able to locate relevant identity records of the same
rithms that process large amounts of data, such as algorithms for individual in police databases. Therefore, the true identity
finding duplicate records from large databases involving mil- of the suspect may be recovered, and more information
lions of records. To find all duplicate records in a database, the becomes available to assist the police investigation.
most reliable way is to compare every record with every other 2) The algorithm detects deceptive identities by examining
record [27]. Such a method apparently is the most inefficient, records currently existing in police databases. This re-
especially when it is applied to large databases, because of its quires an efficient algorithm that deals with large data vol-
time complexity (O(N 2 )). umes, especially when data are integrated from different
Much database research has focused on data comparison effi- sources.
ciency. Hernandez and Stolfo [20] presented a sorted neighbor- We have identified a record comparison algorithm that is
hood method (SNM) for the so-called merge/purge problems, most appropriate for detecting identity deception. We aim to
in which data were merged from multiple sources. The SNM improve this algorithm using techniques that allow it to deal
has three steps, namely: 1) creating sorting keys; 2) sorting with missing values and make it efficient and scalable with large
data; and 3) merging duplicates. A key is made by extracting data volumes. Our research questions are as follows.
a relevant attribute or a combination of relevant attributes. The 1) Can the improved technique effectively detect deceptive
selection of a key, determined mainly by domain-dependent identities with records having missing values?
knowledge, is critical for final merging results [21]. The dataset 2) Is the improved technique efficient and scalable enough to
is then sorted by the selected key in the sorting phase. During handle the large amount of identities in police databases
the merging phase, a window of a fixed size sequentially moves while the detection accuracy is maintained?
through the sorted dataset from the top. Every new record
entering the window compares with the previous records in the
IV. A DAPTIVE D ETECTION A LGORITHM
window and looks for matching records. To maintain the fixed
window size, the first record in the window is dropped when a We aim to develop a detection algorithm that can adapt to
new record enters a full window. The time complexity of the real-world applications where missing values are prevalent and
SNM is O(wN ) (the time complexity of the merging phase) data volume is often on the order of millions. In this section, we
if w < log N , or else O(N log N ) (the time complexity of the propose an adaptive detection algorithm for detecting identity
sorting phase), where w is the window size and N is the total deception. We use an improved version of the record compar-
number of records in the dataset. Experiments showed that the ison algorithm’s process, so that identities containing missing
SNM could achieve high detection accuracy and greatly reduce values can be compared based on known attributes. The new
running time. The SNM assumes that duplicate records sorted algorithm also incorporates the heuristics of Monge’s adaptive
by an appropriate key are located close to each other, which is duplicate detection method. We expect the efficiency of the
not always the case. One may increase the window size to find detection process to be highly improved.
potential duplicates; however, this may increase the running We choose to use an adaptive analysis method to handle
time as well. the problem of missing values. Our intention is to make use
Monge [28], [29] proposed an adaptive duplicate detection of as many known attribute values as possible and to ignore
algorithm that further improved the detection efficiency over missing values. Deletion methods discard not only attributes
the SNM. Like the SNM, this method also starts by creating that have missing values but also some attribute values that are
a sorting key and sorts the dataset with the key. Whereas not missing. Statistics-based imputation methods try to impute
a window sequentially scans the sorted dataset, it does not missing values based on the statistical relationship between
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
992 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 5, SEPTEMBER 2006

attribute values that are missing and those that are not. However,
they require attributes to be either quantitative or categorical, so
that statistical relationship can be established. In our case, most
of the attributes (e.g., name and address) are textual. Statistical
relationships between these attributes do not make sense (e.g.,
it would be strange to conclude that people named “George”
usually live on “Broadway Blvd.”).
In the pairwise record comparison algorithm, identity records
containing missing values are simply discarded (i.e., listwise
deletion). In the proposed adaptive detection algorithm, only
the missing attributes are ignored, whereas other available
attributes are used in comparing a pair of identities. Here, we
assume that every two identities being compared have at least
one nonmissing attribute. We also assume that two matching
identities have similar values on all attributes. We modify the
original formula given in the previous section as

 d2Name + d2Addr + d2SSN + d2DOB
d = (2)
a

where a is the number of attributes that are available in both


identity records being compared. The disagreement measures
on missing attributes are set to zero. The heuristic is similar to
what police officers would do when they manually compare two
identities. It is obvious that the higher the number of missing
values, the less confident the overall disagreement is.
We apply Monge’s algorithm to our proposed algorithm to Fig. 2. Procedures of the adaptive detection algorithm.
improve efficiency. The first step of Monge’s algorithm is to
sort the dataset according to a key attribute. Sorting on some be dropped from q if a new cluster is inserted into an already
attributes may lead to better results than sorting on the others. full queue. If a dropped cluster contains more than one identity
The key attribute can be determined by a training process. record, this indicates that deceptive identities are found.
However, no single key will be sufficient to catch all matching An example would make this clustering process much easier
records in general [21]. Hernandez and Stolfo suggested a mul- to understand. Suppose the dataset is sorted on name and the
tipass approach that executes several independent runs of the window size w (i.e., the capacity of the priority queue q) is set
algorithm, each time using a different key attribute. On the other to 4. We start to look at the first record R0 from the top of the
hand, the multipass approach will increase the computation sorted dataset. Because q is empty at the beginning, we do not
time. In this study, we only consider the single-pass approach. have any clusters to compare against. Therefore, a new cluster
The procedure for the revised detection method is shown C0 is created with R0 as its only record and is put in q. We
in Fig. 2. First, the whole dataset is sorted by a chosen key then examine the next record R1 . We first compare R1 with the
attribute. The window size w is set in step 2, which defines representative record (R0 ) of the only cluster C0 in q (step 7).
the range of nearby records being compared. The window is Suppose R1 matches R0 (i.e., the disagreement value of the
represented as a priority queue, which can contain at most w two records is less than a given threshold), we include R1 in
elements (i.e., clusters). The algorithm sequentially examines C0 (step 8) and go back to step 4 to examine the next record
each record Ri in the sorted dataset starting from the top. In R2 . Similarly, R2 is first compared with R0 , the representative
step 7, Ri is first compared with the representative record (the record of cluster C0 . If the two records do not match, R2 is com-
record that represents the cluster; we use the first record of pared with R1 , the nonrepresentative record in C0 (step 14).
each cluster to simplify the process) of each existing cluster If R2 and R1 match, R2 is included in C0 . If they do not match,
Cj in a priority queue q. If a comparison suggests a match a new cluster C1 is created with R2 as its only record and
(i.e., the disagreement value of the two records is less than a becomes the second element in q. This procedure is repeated
given threshold) between Ri and Cj ’s representative, Ri will be until all records are examined. The first cluster (e.g., C0 ) will
merged into Cj . If Ri fails to find a match, it will continue to be removed from q when q is full (i.e., the number of clusters
compare with the nonrepresentative records (i.e., records except in q is equal to w). Therefore, a new record will only be able to
the first one) of each Cj in q. If a match is found, Ri will be compare the records contained in q.
merged into the cluster where the matched record belongs. If Ri The time complexity of the proposed adaptive detection
cannot be merged into any cluster in q (such as in the beginning method becomes O(w N ) (the time complexity of the merging
when clusters do not exist in q), a singleton cluster is created phase) if w < log N , or otherwise O(N log N ) (the time com-
for Ri in step 19 and is inserted into q in step 23. The lowest plexity of the sorting phase), where w is the window size and
priority cluster in q (i.e., the cluster first put in the queue) will N is the total number of records in the dataset. Compared to the
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: AUTOMATICALLY DETECTING CRIMINAL IDENTITY DECEPTION 993

pairwise comparison algorithm, the adaptive detection method


is expected to be much more efficient.

V. E XPERIMENTS
In this section, we aim to test the effectiveness and the
efficiency of the proposed adaptive detection algorithm. Exper-
iments are conducted to answer the following questions. Fig. 3. Pairwise comparison matrix constructed from the two clusters.
1) Will the detection accuracy be maintained when employ-
TABLE I
ing the adaptive detection algorithm? CLASSIFICATION OF ALGORITHM OUTCOMES
2) Can the adaptive detection algorithm detect deceptive
identity records that contain missing values?
3) How does the adaptive detection algorithm perform with
large datasets?

A. Performance Matrix
Algorithm performance is measured in terms of detection
effectiveness and efficiency. superdiagonal element in the matrix represents the comparison
1) Detection Accuracy: We evaluate the algorithm’s de- result between any two identity records. It is labeled as one
tection accuracy by using three kinds of measures, namely: when two identity records are grouped in the same cluster by
1) recall; 2) precision; and 3) F -measure. Those measures are the algorithm; otherwise, it is labeled as zero. We will have four
widely used in information retrieval [34]. Precision, in this outcomes defined in Table I. In this example, we have TP = 2,
scenario, is defined as the percentage of correctly detected FP = 2, TN = 4, and FN = 2.
deceptive identities in all deceptive identities suggested by the Based on the algorithm outcomes, we compute recall and
algorithm. Recall is the percentage of deceptive identities cor- precision as the following:
rectly identified. F -measure is a well-accepted single measure
that combines recall and precision. TP
Recall = (5)
Suppose a set of identities D contains m unique individuals TP + FN
and each individual has at least one identity. Each individual TP
Precision = . (6)
may have a set of different identities denoted as Di (1 ≤ i ≤ m TP + FP
and |Di | ≥ 1). Let dij (1 ≤ i ≤ m, j ≥ 1) denote the jth iden- F -measure is defined as
tity of the ith individual. The detection algorithm groups all
identities into n clusters based on identified identity deception. 2 ∗ Precision ∗ Recall
F -measure = . (7)
That is, deceptive identities that are considered as referring to Precision + Recall
the same individual by the detection algorithm are grouped into
2) Efficiency and Scalability: Efficiency is measured by the
the same cluster. Each cluster identified by the algorithm is
number of comparisons that the algorithm requires to detect
denoted as C
all deceptive identities within a dataset. Algorithm completion
Ck = {dij |dij ∈ D and dij referring to the kth individual} (3) time is a supplementary efficiency measure.
According to the Longman Web Dictionary, scalability of an
algorithm can be defined as the degree to which the algorithm
where k = 1, 2, . . . , n. The clusters have the following becomes more efficient as the data volume increases. We de-
properties: fine scalability to be proportional to the number of identities
Ck ∩ Ck = ∅ processed per unit of time, i.e.,

Ck = D. (4) Number of records in a dataset
Scalability ∝ . (8)
k Completion time
Identities of the same cluster are considered to refer to
B. Experimental Design
the same person, whereas identities of different clusters are
considered irrelevant. To make performance measures of clus- In our experiments, we compared the performance of the
tering results comparable to those of the pairwise comparison proposed adaptive detection algorithm with that of the record
method, we convert the clustering results to a matrix that is comparison algorithm. We did not compare with the perfor-
often generated by the pairwise comparison method. For ex- mance of other deception detection techniques because they
ample, suppose person A has two different identities {A1 , A2 }, are not directly comparable. We aim to examine how the algo-
whereas person B has three identities {B1 , B2 , B3 }. Suppose rithm’s performance improves when incorporating techniques
the adaptive detection algorithm identifies two clusters, namely: that handle the problems of missing values and large volumes of
{A1 , A2 , B1 } and {B2 , B3 }. A pairwise comparison matrix data. We expect that those techniques developed in the proposed
is constructed from the clusters as shown in Fig. 3. Each algorithm will also apply to other computational deception
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
994 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 5, SEPTEMBER 2006

TABLE II significantly lower accuracy rates for incomplete datasets. We


DIFFERENT MISSING TYPES IN IDENTITY RECORDS OF THE TPD
also aim to find out whether the adaptive detection algorithm
can find deceptive identities within an acceptable time (e.g.,
in minutes) when the dataset is large (e.g., in the order of
millions). The hypotheses for testing the above objectives are
discussed below.
a) Evaluating accuracy and efficiency: We compare the
performance of the adaptive detection algorithm with that of
the record comparison algorithm. Two hypotheses are proposed
to compare the efficiency and the detection accuracy of the
two algorithms. We use statistical t-tests in the comparisons to
indicate the significance of any differences.
— Hypothesis 1 (H1): There is no significant difference in
detection techniques reviewed in Section II-B and will improve detection effectiveness between the adaptive detection
their performance. algorithm and the record comparison algorithm.
The datasets of deceptive identities used in our experiments — Hypothesis 2 (H2): There is no significant difference
were manually extracted by our police detective expert who in detective efficiency between the adaptive detection
has served law enforcement for 30 years. The sampling method algorithm and the record comparison algorithm.
the expert used was convenience sampling, in which he looked • Testing dataset: A police detective with 30 years of
through the list of all identity records and chose the deceptive experience helped us identify 210 deceptive criminal
identity records that he ran into. Because deceptive identities identity records from the TPD database. The dataset
are sparsely distributed in the criminals’ database, convenience involved 75 criminal individuals, each of whom had
sampling is more feasible than random sampling to locate an average of three identity records. These identity
deceptive identity records for experimental purposes. records contain no missing values. All the addresses
1) Test Bed: We chose criminal identity records stored in were manually converted to a standard format con-
the TPD as our test bed. According to the U.S. Census Bu- sisting of a street number, a street direction, a street
reau, Tucson’s population ranked 30th among U.S. cities with name, and a street type.
populations of 100 000 and greater. The Federal Bureau of • Testing procedure: A ten-fold validation method
Investigation also reported that Tucson’s crime index ranked was employed to validate the performance of the
20th highest among U.S. cities in 2001 and was higher than two algorithms. The dataset was randomly equally
the national average. Therefore, data kept in the TPD are repre- divided into ten folds. Each time, we used nine
sentative of those stored in other agencies in terms of variety folds for training and one fold for testing. In each
and data volume. training session, we determined an optimal thresh-
The TPD maintains about 1.3 million person records in the old that distinguished between similar (i.e., decep-
database. Each record uniquely identifies a person by a set of tive) and dissimilar (i.e., irrelevant) records, when
identity attributes. In this experiment, we only focus on four the highest F -measure was achieved. The threshold
attributes in which identity deception usually occurs, namely: was then applied to the next testing session. Accu-
1) name; 2) address; 3) DOB; and 4) SSN. The name attribute racy measures, as well as the number of compar-
of each identity record is mandatory in the TPD and always has isons and the completion time, were recorded for
a value. We found a large number of missing values in the other each testing session. Performance measures of the
three attributes; 76% of these records contain missing values two algorithms were compared using a statistical
in at least one attribute. Among these incomplete records, we t-test.
found that 42% contain one missing attribute, 29% have two
missing attributes, and 4% of the records were missing all b) Evaluating the effects of missing values: We compare
attribute values except for name. The distribution of different the detection accuracy of the algorithm when using a complete
missing types is shown in Table II. Certain missing types, such dataset and when using an incomplete dataset. Again, t-tests
as address-missing, DOB-missing, and address-DOB-missing, were used to indicate whether there was a significant difference
are rare in the TPD database. Inasmuch as all fields except name in the algorithm’s detection accuracy. To examine how differ-
can be missing in the TPD database, we chose name as the ent types of incomplete datasets may affect the algorithm’s
sorting key for the adaptive detection algorithm in hypotheses detection accuracy, we varied the missing attribute(s) (i.e.,
testing. attributes where missing values may occur) in the dataset and
2) Hypotheses Testing: We expect the proposed adaptive the percentage of incomplete records in the dataset. We learned
detection algorithm, as compared with the pairwise record from the TPD database that identity records missing more
comparison algorithm, to improve its efficiency in detecting de- than two attribute values are rare. Therefore, we tested with
ceptive identities without losing detection accuracy. Although incomplete datasets having no more than two attributes con-
we do not expect detection accuracy to maintain when a dataset taining missing values.
has several missing attributes and a large percentage of missing — Hypothesis 3 (H3): With the adaptive detection algorithm,
values, we want to find out what circumstances could cause there is no significant difference in detection effectiveness
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: AUTOMATICALLY DETECTING CRIMINAL IDENTITY DECEPTION 995

between identities having all attribute values and identi- TABLE III
COMPARISON BETWEEN DETECTION EFFECTIVENESS OF THE ADAPTIVE
ties having at most two missing attribute values. DETECTION ALGORITHM AND THE RECORD COMPARISON ALGORITHM.
• Testing datasets: First, we conducted experiments (a) ALGORITHM EFFECTIVENESS IN TERMS OF F -MEASURE.
using artificial incomplete datasets. In the TPD (b) ALGORITHM EFFICIENCY IN TERMS OF NUMBER OF
COMPARISONS AND COMPLETION TIME
database, deceptive identities with certain miss-
ing attributes (e.g., DOB-missing or address-
DOB-missing) are rare. With artificially generated
incomplete datasets, we constructed various types
of incomplete datasets by adjusting the composi-
tion of missing attributes as well as the percentage
of incomplete records in each dataset. Incomplete
datasets were derived from the complete dataset used
in the previous experiment. For each dataset, we
randomly chose a percentage (from 10% to 90% with
an increment of 10%) of records from which we
removed values in the intended missing attribute(s).
Second, we used a real incomplete dataset that was
directly extracted from the TPD database by our
police detective. Our intention is to avoid any sys-
tematic errors that might be caused by the artificially
generated incomplete datasets. From the TPD data-
base, we were able to draw a dataset of 210 deceptive
records in which missing values occurred in SSN
only. Deceptive records missing values in other fields
were not identified, either because certain missing
types (e.g., address-missing, DOB-missing) were
rare in the TPD database or because the police expert
was not able to identify deceptive identities based on
limited available values (e.g., SSN-Address-missing
and SSN-DOB-missing).
• Testing procedure: For each missing type, we tested
the proposed algorithm for several iterations, each
of which had a different percentage (ranging from
10% to 90%) of missing values in the dataset for
• Testing procedure: For each selected dataset, we de-
the intended field(s). During each iteration, we used
tected deceptive identities using the adaptive detec-
a ten-fold validation method to test the algorithm’s
tion algorithm and the record comparison algorithm,
detection accuracy. As in the previous experiments,
respectively. The scalability of each algorithm, as
an optimal threshold value was determined when the
defined earlier, was computed for each test. A t-test
highest F -measure was achieved during the training
was performed to compare the scalability difference
session. The detection accuracy measures of the
between the two algorithms over different sizes of
algorithm were recorded during the testing session.
datasets.
T -tests were used to compare F -measures achieved
by the algorithm using incomplete datasets to those
acquired using a complete dataset. VI. R ESULTS AND D ISCUSSIONS
c) Evaluating scalability: In terms of scalability, we A. Effectiveness of the Adaptive Detection
compare the adaptive detection algorithm to the record compar- Algorithm (H1 and H2)
ison algorithm when detecting deception in large datasets (e.g.,
Table III shows the detection accuracy, in terms of F -
on the order of millions).
measure, achieved by the adaptive detection algorithm and the
— Hypothesis 4 (H4): There is no significant difference in record comparison algorithm, respectively. A t-test showed that
scalability between the adaptive detection algorithm and there was no significant difference between the two algorithms
the record comparison algorithm. (p-value = 0.659).
• Testing datasets: We randomly selected 10 000 crim- Algorithm efficiency measures achieved by the two algo-
inal identity records from the TPD database as the rithms, in terms of number of comparisons and completion
starting dataset for our scalability testing. We then time, are also listed in Table III. H2 was also tested with a
increased the size of the selection by 10 000 at a time t-test and was rejected at a significant level (p-value 0.05).
until all identity records in the TPD database (about The result showed that the adaptive detection algorithm is more
1.3 million) were included. efficient than the pairwise record comparison algorithm.
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
996 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 5, SEPTEMBER 2006

Fig. 5. Performance comparison between the complete dataset and the


datasets missing values in two attributes (σ is the significance level of the
t-test).
Fig. 4. Performance comparison between the complete dataset and the
datasets missing values in one attribute (σ is the significance level of the t-test). P -values in Fig. 5 show the adaptive detection algorithm’s
performance differences between using a complete dataset and
B. Adaptive Detection Algorithm in Handling Missing
using a dataset in which identity records contain missing values
Values (H3)
for two attributes. When values were missing exclusively in
1) Testing With Artificially Generated Missing Values: We SSN and DOB, the detection accuracy of the adaptive detection
used p-values of t-tests to indicate whether there was a signifi- algorithm did not significantly decrease if the percentage of
cant difference in detection accuracy between using a complete incomplete records was less than 12%. Similar to the one-
dataset and using a dataset that contained a certain percentage attribute-missing case, detection accuracy varied when there
of missing values in certain attributes. For each type of in- were missing values in the address field.
complete dataset (i.e., values missing in certain attributes), we To explain why the existence of missing values in the
plotted p-values against the percentage of incomplete identity address field brought variations to the algorithm’s detection
records contained in a dataset to indicate the significant changes accuracy, we examined the characteristics of address values in
in the algorithm’s effectiveness. The effect of the amount of the complete dataset and compared them with the SSN and the
missing values on detection accuracy is clearly visible. DOB. For each attribute, the distribution of disagreement values
P -values in Fig. 4 indicate the adaptive detection algorithm’s between related identities (i.e., different identities referring to
performance differences between using a complete dataset and the same individual) is shown in Fig. 6. We noticed that the
using a dataset in which identity records contain missing values distribution for the address attribute is very different from that
for one attribute. When values were only missing for SSN, for DOB or SSN. DOB and SSN both have a skewed distrib-
the detection accuracy (F -measure) of the adaptive detection ution such that identities pointing to the same person mostly
algorithm did not significantly decrease if the percentage of have very similar DOB or SSN values. Address, however, has a
incomplete records was less than 30%. Similarly, when values bipolar distribution of disagreement values. In our dataset, iden-
were only missing for DOB, the detection accuracy of the tities of the same individual sometimes have similar address
adaptive detection algorithm did not lower significantly if the values and sometimes have very different address values. Such
percentage of incomplete records was less than 18%. However, a difference between address and the other two attributes might
there were significant variations in the detection accuracy when explain the difference in the algorithm’s detection accuracy.
values were missing in the address attribute, regardless of the 2) Testing With Real Missing Values: This dataset extracted
percentage of incomplete records. from the TPD database had missing values in the SSN field
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: AUTOMATICALLY DETECTING CRIMINAL IDENTITY DECEPTION 997

TABLE IV
DETECTION PERFORMANCE WITH REAL MISSING VALUE. (a) DETECTION
PERFORMANCE WITH RECORDS CONTAINING REAL MISSING VALUES.
(b) DETECTION PERFORMANCE WITH COMPLETE RECORDS

Fig. 6. Distribution of disagreement values on each attribute.

only. As shown in Table IV, the adaptive detection algorithm


was able to achieve on average a high precision of 93.7%
and a recall of 73.6%. Compared to the detection performance
using complete records, the detection precision was decreased
for records with values missing in SSN. However, there was
a significant decrease in the detection recall, which led to a
significant drop in the overall F -measure. Two possible reasons
may cause low detection recalls, namely: either two identity
records of the same individual are located too far apart (e.g.,
much larger than the size of the sliding window in the adaptive
detection algorithm), or the threshold value is too strict in
determining deceptive identities.

C. Efficiency and Scalability (H4)


Scalability measures of the two algorithms are shown in
Fig. 7. The adaptive detection algorithm took 6.5 min for
the adaptive detection algorithm to finish detecting deceptive Fig. 7. Efficiency and scalability performance. (a) Scalability of the adaptive
detection algorithm. (b) Scalability of the record comparison algorithm.
identity in 1.3 million records. As the data volume increased,
it maintained a gentle slope in the time it needed to finish ducted on an HP PC with a Pentium III 800-MHz CPU and
detections. Note that the 6.5 min did not include the sorting 256-MB RAM.
time. Sorting was performed within the database. It would
add very minor overhead to the overall running time if the
D. Case Study
database was appropriately indexed. However, the detection
time of the record comparison algorithm increased dramati- To further evaluate the implication of our proposed al-
cally. It would have spent 87 days on the same task. Both gorithm, we tested it with another real dataset provided by
algorithms were implemented in Java. Experiments were con- the Pima County Sheriff Department (PCSD). PCSD serves
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
998 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 5, SEPTEMBER 2006

330 000 people living in the seventh largest county in the nation. conclusion that two records, in which only the first name “John”
We consider it as a representative of law enforcement agencies was recorded, would have the same probability of describing
in the U.S. The PCSD dataset contained over 1.3 million the same person as two records, in which all of the fields exist.
identity records. Residential address and SSN information was Intuitively, if name is the only available field to compare, one
not available in the dataset and was considered missing. We can only judge the probability that two identities describe the
ignored those records that only had names because it is not same person solely by the names. However, the confidence in
reliable to determine deception solely by names. There were the match increases as more fields are available to compare.
700 686 identity records remaining in the testing dataset, each One of the intentions of our proposed algorithm is to
of which has values in the attributes of first name, last name, avoid pairwise comparisons, so that detection efficiency can be
and DOB. With a window size of 10, our algorithm was improved. However, detection effectiveness may be affected,
able to identify 16 912 clusters. Identities of each cluster were whereas the efficiency is improved under the assumption that
considered to refer to the same person. We randomly chose two identities of the same individual sorted by an appropri-
20 clusters and asked our police detective expert to evaluate ate key are located close to each other. That assumption is,
each of them. The expert from the TPD confirmed that 11 out of however, not guaranteed. It is possible that the two identities
20 clusters were correctly grouped. There were six clusters that are located too far apart to be grouped into the same cluster.
the expert from the TPD could not verify because of limited Although the algorithm did not cause a significant drop of
information. Three clusters were incorrectly clustered due to detection efficacy in our experiments, we will consider more
the use of common names and similar DOBs. advanced clustering algorithms such as mixture models to avoid
The expert from the TPD found this algorithm useful in the assumption in future work.
finding both deceptive identity records and records that have In addition to detecting intentional deception, both record
data errors such as misspellings. Currently, the record manage- comparison algorithm and the proposed adaptive detection
ment system used by this agency is not able to automatically algorithm are capable of dealing with identity records having
group the identity records that refer to the same person. The six unintentional data errors such as misspellings. It might be
clusters that the expert from the TPD was unable to verify could interesting to differentiate between the patterns of deception
also be useful in providing additional leads during investigation and errors. However, we do not perceive any difference in terms
processes. of the algorithm’s effectiveness.
In the future, we intend to consider other identity-related
VII. C ONCLUSION AND F UTURE W ORK information, such as biometrics, behavior characteristics, and
social context. A good example of behavior characteristics
In this paper, we discussed algorithmic approaches to auto- is MO, which is often used to identify a criminal in crime
matically detecting criminal identity deception. We proposed investigation. The social context is a set of characteristics of
an adaptive detection algorithm that improved the record com- the social system that a person usually lives. These types
parison algorithm in terms of efficiency, scalability, and abil- of information can also be helpful in determining a person’s
ity to handle incomplete identities. Experiments showed that identity. The core function of our proposed algorithm is to
the proposed algorithm greatly improved detection efficiency combine the disagreement measure of each of the four attributes
and achieved detection accuracy comparable with that of the and to determine the disagreement (or similarity) between two
pairwise record comparison algorithm. Our experiments also identity records. It is open to include more identity attributes
showed that the detection accuracy of the adaptive detection when a disagreement measure can be defined for each attribute.
algorithm was not affected when there was a small percentage A more comprehensive model that encompasses more identity
of attribute values missing (less than 30% for missing values attributes is desirable in future research.
on SSN or less than 18% for missing values on DOB). In cases The proposed automated deception detection system will be
where there was a larger percentage of attribute values missing, incorporated into our ongoing COPLINK project [19], which
the adaptive detection algorithm could still maintain detection has been under development at the University of Arizona’s
precision of around 95%. Artificial Intelligence Lab, in collaboration with the TPD, and
However, limitations exist in this paper. The testing dataset PCSD, since 1997. Such a system can also be used in merging
is relatively small. The changing data characteristics of the customer profiles for marketing purposes.
testing dataset may affect the algorithm’s performance. The
algorithm’s parameters (e.g., window size of the priority queue R EFERENCES
and/or threshold values) may be adjusted when running the [1] Identity Fraud: A Study. (2002). London, U.K.: Home office. [Online].
algorithm in a different dataset. Available: http://www.homeoffice.gov.uk/cpd/id_fraud-report.pdf
Our proposed algorithm assumes that all attributes are [2] P. D. Allison, Missing Data. Thousand Oaks, CA: Sage, 2001.
[3] A. S. J. Aubry, Criminal Interrogation, 3rd ed. Springfield, IL: Charles
equally important. Therefore, it assigns an equal weight to C. Thomas, 1980.
each attribute when combining disagreement measures of the [4] A. B. Badiru, J. M. Karasz, and B. T. Holloway, “AREST: Armed robbery
four attributes into an overall measure between two identity eidetic suspect typing expert system,” J. Police Sci. Admin., vol. 16, no. 3,
pp. 210–216, Sep. 1988.
records. We may consider a different weighting schema. For [5] D. E. Brown and S. Hagen, “Data association methods with applications
example, in the future, we may assign less weight to the to law enforcement,” Decis. Support Syst., vol. 34, no. 4, pp. 369–378,
address attribute because disagreement measures among related 2003.
[6] S. F. Buck, “A method of estimating missing values in multivariate data
addresses introduce noise rather than contribute to the detection suitable for use with an electronic computer,” J. R. Statist. Soc., vol. B22,
of deceptive identities. The assumption would also lead to the no. 2, pp. 302–306, 1960.
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: AUTOMATICALLY DETECTING CRIMINAL IDENTITY DECEPTION 999

[7] J. K. Burgoon, D. B. Buller, L. K. Guerrero, W. Afifi, and C. Feldman, [35] J. L. Schafer, Analysis of Incomplete Multivariate Data. London, U.K.:
“Interpersonal deception: XII. Information management dimensions un- Chapman & Hall, 1997.
derlying deceptive and truthful messages,” Commun. Monogr., vol. 63, [36] H. Timm and F. Klawonn, “Different approaches for fuzzy cluster analy-
no. 1, pp. 50–69, Mar. 1996. sis with missing values,” presented at the 7th Eur. Congr. Intelligent
[8] K. C. Clarke, Getting Started With Geographic Information Systems, Techniques and Soft Computing, Aachen, Germany, 1999.
2nd ed. Upper Saddle River, NJ: Prentice-Hall, 1999. [37] S. Toth. (2003). Need Fuels Demand for False IDs: For Jobs,
[9] R. Clarke, “Human identification in information systems: Management Documents are the Key, South Bend, IN: South Bend Tribune. [Online].
challenges and public policy issues,” Inf. Technol. People, vol. 7, no. 4, Available: http://www.southbendtribune.com/stories/2003/07/27/local.
pp. 6–37, Dec. 1994. 20030727-sbt-FULL-A1-Need_fuels_demand_fo.sto
[10] J. Cohen, “Errors of recall and credibility: Can omissions and discrepan- [38] A. Vrij, Detecting Lies and Deceit: The Psychology of Lying and the
cies in successive statements reasonably be said to undermine credibility Implication for Professional Practice. Hoboken, NJ: Wiley, 2000.
of testimony?” Med.-Leg. J., vol. 69, no. 1, pp. 25–34, 2001. [39] G. Wang, H. Chen, and H. Atabakhsh, “Automatically detecting deceptive
[11] B. M. DePaulo and R. L. Pfeifei, “On-the-job experience and skill at criminal identities,” Commun. ACM, vol. 47, no. 3, pp. 71–76, Mar. 2004.
detecting deception,” J. Appl. Soc. Psychol., vol. 16, no. 3, pp. 249–267, [40] A. P. White, W. Z. Liu, M. T. Hallissey, and J. W. L. Fielding, “A compar-
1986. ison of two classification techniques in screening for gastro-esophageal
[12] J. S. Donath, “Identity and deception in the virtual community,” in Com- cancer,” in Proc. Appl. Innovations Expert Syst. IV, 1996, pp. 83–97.
munities in Cyberspace, P. Kollock and M. Smith, Eds. London, U.K.:
Routledge, 1998.
[13] P. Ekman, M. O’Sullivan, “Who can catch a liar?” Amer. Psychol., vol. 46, G. Alan Wang (S’05) received the B.S. degree in in-
no. 9, pp. 913–920, Sep. 1991. dustrial management engineering from Tianjin Uni-
[14] P. Ekman, Telling Lies: Clues to Deceit in the Marketplace, Politics and versity, Tianjin, China, in 1995, the M.S. degree in
Marriage, 3rd ed. New York: Norton, 1992. industrial engineering from Louisiana State Univer-
[15] GAO, “Law enforcement: Information on timeliness of criminal fin- sity, Baton Rouge, in 2001. He is currently working
gerprint submissions to the FBI,” U.S. Gen. Accounting Off. (GAO), toward the Ph.D. degree in management information
Washington, DC, GAO-04-260, 2004. systems at the University of Arizona, Tucson.
[16] M. Glasser, “Linear regression analysis with missing observations among He has published papers in the Communications
the independent variables,” J. Amer. Statist. Assoc., vol. 59, no. 307, of the ACM, the Journal of the American Soci-
pp. 834–844, Sep. 1964. ety for Information Science and Technology, IEEE
[17] S. O. Gyimah, “Missing data in quantitative social research,” Dept. So- COMPUTER and Group Decision and Negotiation.
ciology, Univ. Western Ontario, London, ON, Canada, Rep. 01-14, 2001. His research interests include data heterogeneity and uncertainty, data mining,
[18] Y. Haitovsky, “Missing data in regression analysis,” J. R. Statist. Soc., and knowledge management.
vol. B30, no. 1, pp. 67–82, 1968.
[19] R. V. Hauck, H. Atabakhsh, P. Ongvasith, H. Gupta, and H. Chen, “Using
COPLINK to analyze criminal-justice data,” Computer, vol. 35, no. 3,
Hsinchun Chen (M’92–SM’04–F’06) received the
pp. 30–37, Mar. 2002.
Ph.D. degree in information systems from New York
[20] M. A. Hernandez and S. J. Stolfo, “The merge/purge problem for
University, New York, in 1989.
large databases,” in Proc. ACM SIGMOD Int. Conf. Management Data,
He is currently a McClelland Endowed Profes-
San Jose, CA, 1995, pp. 127–138.
sor at the Department of Management Information
[21] ——, “Real-world data is dirty: Data cleansing and the merge/purge Systems, University of Arizona, Tucson. He has
problems,” Data Mining Knowl. Discov., vol. 2, no. 1, pp. 9–37, 1998.
authored or coauthored over 70 papers concern-
[22] G. Jones. (2001). E-Commerce and Identity Fraud, Nottingham, U.K.:
ing semantic retrieval, search algorithms, knowledge
Experian Co. [Online]. Available: http://press.experian.com/documents/
discovery, and collaborative computing. He is an
e-comm.pdf expert in digital library and knowledge manage-
[23] J. Kim and J. Curry, “The treatment of missing data in multivariate
ment research, and his research has been featured in
analysis,” Sociol. Methods Res., vol. 6, no. 2, pp. 206–240, 1977.
scientific and information technology publications.
[24] G. Kohnken, “Training police officers to detect deceptive eyewitness state-
ments: Does it work?” Soc. Behav., vol. 2, no. 1, pp. 1–17, 1987.
[25] R. E. Kraut and D. Poe, “On the line: The deception judgements of
customs inspectors and laymen,” J. Pers. Soc. Psychol., vol. 39, no. 5, Jennifer J. Xu received the M.S. degree in computer
pp. 784–798, 1980. science and the M.A. degree in economics from
[26] V. L. Levenshtein, “Binary codes capable of correcting deletions, inser- the University of Mississippi, University, in 1999
tions, and reversals,” Soviet Phys. Doklady, vol. 10, no. 8, pp. 707–710, and 2000, respectively. She is currently working
Feb. 1966. toward the Ph.D. degree in management information
[27] W. L. Low, M. L. Lee, and T. W. Ling, “A knowledge-based approach systems.
for duplicate elimination in data learning,” Inf. Syst., vol. 26, no. 8, She is currently an Assistant Professor of com-
pp. 585–606, Dec. 2001. puter information systems at Bentley College,
[28] A. E. Monge, “Adaptive detection of approximately duplicate data- Waltham, MA. Her research interests include knowl-
base records and the database integration approach to information dis- edge management, social network analysis, infor-
covery,” Ph.D. dissertation, Dept. Comput. Sci. Eng., Univ. California, mation retrieval, human–computer interaction, and
San Diego, 1997. information visualization.
[29] A. E. Monge and C. P. Elkan, “An efficient domain-independent algorithm
for detecting approximately duplicate database records,” in Proc. ACM-
SIGMOD Workshop Research Issues Knowledge Discovery Data Mining, Homa Atabakhsh received the M.S. and Ph.D. de-
Tucson, AZ, 1997, pp. 23–29. grees from the University of Toulouse, Toulouse,
[30] L. Myrtveit, E. Stensrud, and U. H. Olsson, “Analyzing data sets France, in 1984 and 1987, respectively, all in com-
with missing data: An empirical evaluation of imputation methods and puter science.
likelihood-based methods,” IEEE Trans. Softw. Eng., vol. 27, no. 11, She was an Assistant Professor with the University
pp. 999–1013, Nov. 2001. of Toulouse from January 1988 to January 1989.
[31] J. R. Quinlan, “Induction of decision tree,” Mach. Learn., vol. 1, no. 1, From January 1989 to 1996, she was a Research Sci-
pp. 81–106, 1986. entist at the National Research Council of Canada,
[32] ——, “Unknown attribute values in induction,” in Proc. 6th Int. Machine Ottawa, ON, Canada, where she worked in areas such
Learning Workshop, 1989, pp. 164–168. as knowledge-based systems, object-oriented design
[33] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. New and programming, graphical user interface, and ap-
York: Wiley, 1987. plications in manufacturing and business. She has also been an Adjunct Lec-
[34] G. Salton, Automatic Text Processing: The Transformation, Analysis, and turer at the University of Ottawa. Currently, she is the Associate Director of the
Retrieval of Information by Computer. Reading, MA: Addison-Wesley, COPLINK Center for Excellence and a Principal Research Specialist at the De-
1988. partment of Management Information System, University of Arizona, Tucson.
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:27:03 UTC from IEEE Xplore. Restrictions apply.

You might also like