Download textbook Declarative Programming And Knowledge Management Conference On Declarative Programming Declare 2017 Unifying Inap Wflp And Wlp Wurzburg Germany September 19 22 2017 Revised Selected Papers Dietmar Sei ebook all chapter pdf
Download textbook Declarative Programming And Knowledge Management Conference On Declarative Programming Declare 2017 Unifying Inap Wflp And Wlp Wurzburg Germany September 19 22 2017 Revised Selected Papers Dietmar Sei ebook all chapter pdf
Download textbook Declarative Programming And Knowledge Management Conference On Declarative Programming Declare 2017 Unifying Inap Wflp And Wlp Wurzburg Germany September 19 22 2017 Revised Selected Papers Dietmar Sei ebook all chapter pdf
https://textbookfull.com/product/studies-on-speech-
production-11th-international-seminar-issp-2017-tianjin-china-
october-16-19-2017-revised-selected-papers-qiang-fang/
https://textbookfull.com/product/data-management-technologies-
and-applications-6th-international-conference-data-2017-madrid-
spain-july-24-26-2017-revised-selected-papers-joaquim-filipe/
https://textbookfull.com/product/image-and-graphics-9th-
international-conference-icig-2017-shanghai-china-
september-13-15-2017-revised-selected-papers-part-iii-1st-
Dietmar Seipel
Michael Hanus
Salvador Abreu (Eds.)
LNAI 10997
Declarative Programming
and Knowledge Management
Conference on Declarative Programming, DECLARE 2017
Unifying INAP, WFLP, and WLP
Würzburg, Germany, September 19–22, 2017
Revised Selected Papers
123
Lecture Notes in Artificial Intelligence 10997
Declarative Programming
and Knowledge Management
Conference on Declarative Programming, DECLARE 2017
Unifying INAP, WFLP, and WLP
Würzburg, Germany, September 19–22, 2017
Revised Selected Papers
123
Editors
Dietmar Seipel Salvador Abreu
Universität Würzburg Universidade de Èvora
Wuerzburg Evora
Germany Portugal
Michael Hanus
Christian-Albrechts-Universität zu Kiel
Kiel
Germany
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This volume contains a selection of the papers presented at the International Confer-
ence on Declarative Programming Declare 2017. The joint conference was held in
Würzburg, Germany, during September 19–22, 2017. It consisted of the 21st Inter-
national Conference on Applications of Declarative Programming and Knowledge
Management (INAP), the 31st Workshop on Logic Programming (WLP), and the 25th
Workshop on Functional and (Constraint) Logic Programming (WFLP), and it was
accompanied by a one-week summer school on Advanced Concepts for Databases and
Logic Programming for students and PhD students.
Declarative programming is an advanced paradigm for modeling and solving com-
plex problems, which has attracted increased attention over the last decades, e.g., in the
domains of data and knowledge engineering, databases, artificial intelligence, natural
language processing, modeling and processing combinatorial problems, and for estab-
lishing knowledge-based systems for the web. The conference Declare 2017 aimed to
promote the cross–fertilizing exchange of ideas and experiences among researches and
students from the different communities interested in the foundations, applications, and
combinations of high-level, declarative programming and related areas.
The INAP conferences provide a forum for intensive discussions of applications of
important technologies around logic programming, constraint problem solving, and
closely related advanced software. They comprehensively cover the impact of pro-
grammable logic solvers in the Internet society, its underlying technologies, and
leading edge applications in industry, commerce, government, and societal services.
Previous INAP conferences have been held in Japan, Germany, Portugal, and Austria.
The Workshops on Logic Programming (WLP) are the annual meeting of the German
Society for Logic Programming (GLP e.V.). They bring together international
researchers interested in logic programming, constraint programming, and related areas
like databases and artificial intelligence. Previous WLP workshops have been held in
Germany, Austria, Switzerland, and Egypt. The International Workshop on Functional
and Logic Programming (WFLP) brings together researchers interested in functional
programming, logic programming, as well as the integration of these paradigms. Pre-
vious WFLP editions have been held in Germany, France, Spain, Italy, Estonia, Brazil,
Denmark, and Japan. The topics of the papers of this year's joint conference Declare
concentrated on three currently important fields: constraint programming and solving,
functional and logic programming, and declarative programming.
The declarative programming paradigm expresses the logic of a computation in an
abstract way. Thus, the semantics of a declarative language becomes easier to grasp for
domain experts. Declarative programming offers many advantages for data and knowl-
edge engineering, such as, e.g., security, safety, and shorter development time. During the
last couple of years, a lot of research has been conducted on the usage of declarative
systems in areas like answer set programming, reasoning, meta-programming, and
deductive databases. Reasoning about knowledge wrapped in rules, databases, or the
VI Preface
Semantic Web allows to explore interesting hidden knowledge. Declarative techniques for
the transformation, deduction, induction, visualization, or querying of knowledge have the
advantage of high transparency and better maintainability compared to procedural
approaches.
Many problems which occur in large industrial tasks are intractable, invalidating
their solution by exact or even many approximate constructive algorithms. One
approach which has made substantial progress over the last few years is constraint
programming. Its declarative nature offers significant advantages, from a software
engineering standpoint and in the specification, implementation, and maintenance
phases. Several interesting aspects are in discussion: how can this paradigm be
improved or combined with known, classical methods; how can real-world situations
be modelled as constraint problems; what strategies may be pursued to solve a problem
once it has been specified; or what is the experience of applications in really large
industrial planning, simulation, and optimisation tasks?
Another area of active research is the use of declarative programming languages, in
particular, functional and logic languages, to implement more reliable software sys-
tems. The closeness of these languages to logical models provides new methods to test
and verify programs. Combining different programming paradigms is beneficial from a
software engineering point of view. Therefore, the extension of the logic programming
paradigm and its integration with other programming concepts are active research
branches. The successful extension of logic programming with constraints has already
been mentioned. The integration of logic programming with other programming
paradigms has been mainly investigated for the case of functional programming, so that
types, modules, higher-order operators, or lazy evaluation can also be used in
logic-oriented computations.
The three events INAP, WLP, and WFLP were jointly organized by the University
of Würzburg and the Society for Logic Programming (GLP e.V.). We would like to
thank all authors who submitted papers and all conference participants for the fruitful
discussions. We are grateful to the members of the Program Committee and the
external referees for their timely expertise in carefully reviewing the papers. We would
like to express our thanks to the German Federal Ministry of Education and Research
(BMBF) for funding the summer school on Advanced Concepts for Databases and
Logic Programming (under 01PL16019) and to the University of Würzburg for hosting
the conference in the new Central Lecture Building Z6 and for providing the Tuscany
Hall in the Baroque style Würzburg Residence Palace for a classical music concert in
honor of Jack Minker, a pioneer in deductive databases and disjunctive logic pro-
gramming and the longtime mentor of the first editor, who celebrated his 90th birthday
in 2017.
Program Chair
Dietmar Seipel University of Würzburg, Germany
Local Organization
Falco Nogatz University of Würzburg, Germany
Dietmar Seipel University of Würzburg, Germany
Additional Reviewers
Pedro Barahona
Zhuo Chen
Daniel Gall
Falco Nogatz
Nada Sharaf
Contents
Constraints
Declarative Systems
1 Introduction
Constraint Satisfaction Problems (CSPs) allow modeling problems like the Cos-
tas Array problem [6], and some real life problems like planning and schedul-
ing [2], resources allocation [7] and route definition [3].
CPU’s parallelism is already being used with success to speed up the solv-
ing processes of harder CSPs [5,16,19,21]. However, very few constraint solvers
contemplate the use of GPUs. In fact, Jenkins et al. recently concluded that
the execution model and the architecture of GPUs are not well suited to com-
putations displaying irregular data access and code execution patterns such as
backtracking search [10].
We are currently developing a constraint solver named Parallel Heterogeneous
Architecture Toolkit (PHACT) that is already capable of achieving state-of-the-
art performances on multi-core CPUs, and can also speed up the solving process
by adding GPUs and processors like Intel Many Integrated Cores (MICs1 ) to
solve the problems.
1
Intel MICs are coprocessors that combine many Intel processor cores onto a single
chip with dedicated RAM.
c Springer Nature Switzerland AG 2018
D. Seipel et al. (Eds.): DECLARE 2017, LNAI 10997, pp. 3–19, 2018.
https://doi.org/10.1007/978-3-030-00801-7_1
4 P. Roque and V. Pedro
The next section introduces the main CSP concepts and Sect. 3 presents some
related work. Section 4 describes the architecture of PHACT, and in Sect. 5 the
results achieved with PHACT, when solving some CSPs on multiple combi-
nations of devices and when compared with some state-of-the-art solvers, are
displayed and discussed. Section 6 presents the conclusions and directions for
future work.
2 CSPs Concepts
A CSP can be briefly described as a set of variables with finite domains, and a
set of constraints between the values of those variables. The solution of a CSP
is the assignment of one value from the respective domain to each one of the
variables, ensuring that all constraints are met [3].
For example, the Costas Array problem consists in placing n dots on a n × n
matrix such that each row and column contain only one dot and all vectors
between dots are distinct. It can be modeled as a CSP with n + n(n − 1)/2
variables, n of which correspond to the dots and each one is mapped to a different
matrix column. The domain of these n variables is composed by the integers that
correspond to the matrix rows where each dot may be placed. The remaining
n(n − 1)/2 variables constitute a difference triangle, whose rows cannot contain
repeated values [6].
The methods for solving CSPs can be categorized as incomplete or complete.
Incomplete solvers do not guarantee that an existing solution will be found,
being mostly used for optimization problems and for large problems that would
take too much time to fully explore. Incomplete search is beyond the scope of
this paper and will not be discussed here. On the contrary, complete methods,
such as the one implemented in PHACT, guarantee that if a solution exists, it
will be found.
3 Related Work
Searching for CSP solutions in a backtracking approach can be represented in the
form of a search tree. To take advantage of parallelism this search tree may be
split into multiple subtrees and each one of them explored in a different thread
that may be running on a different core, device or machine. This is the approach
generally found in parallel constraint solvers, which run on single or distributed
multi-core CPUs [5,16,19,21].
Pedro developed a CSP solver named Parallel Complete Constraint Solver
(PaCCS) capable of running from a single core CPU to multiple multi-core CPUs
in a distributed system [16]. Distributing the work among the threads through
work stealing techniques and using the Message Passing Interface (MPI) to allow
communication between machines, this solver achieved almost linear speedups
for most of the problems tested, when using machines with up to 16 CPU cores.
Régin et al. implemented Embarrassingly Parallel Search, featuring an inter-
face responsible for decomposing an initial problem into multiple sub-problems,
Constraint Solving on Hybrid Systems 5
filtering out those found to be inconsistent [20]. After generating the sub-
problems it creates multiple threads, each one corresponding to an execution
of a solver (e.g., Gecode [22]), to which a sub-problem is sent at a time for
exploration.
For some optimization and search problems, where the full search space is
explored, these authors achieved average gains of 13.8 and 7.7 against a sequen-
tial version, when using Gecode through their interface or just Gecode, respec-
tively [20]. On their trials, the best results were achieved when decomposing the
initial problem into 30 sub-problems per thread and running 40 threads on a
machine with 40 CPU cores.
While solving CSPs through parallelization has been a subject of research
for decades, the usage of GPUs for that purpose is a recent area, and as such
there are not many published reports of related work. To our knowledge, there
are only two published papers related with constraint solving on GPUs [1,4].
From these two, only Campeotto et al. presented a complete solver [4].
Campeotto et al. developed a CSP solver with Nvidia’s Compute Unified
Device Architecture (CUDA), capable of using simultaneously a CPU and an
Nvidia GPU to solve CSPs [4]. On the GPU, this solver implements an approach
different from the one mentioned before, namely, instead of splitting the search
tree over multiple threads, it splits each constraint propagation over multiple
threads. Constraints relating many variables are propagated on the GPU, while
the remaining constraints are filtered sequentially by the CPU. On the GPU,
the propagation and consistency check for each constraint are assigned to one
or more blocks of threads according to the number of variables involved. The
domain of each variable is filtered by a different thread.
Campeotto et al. reduced the data transfer to a minimum by transferring to
the GPU only the domains of the variables that were not labeled yet and the
events generated during the last propagation. Events identify the changes that
happened to a domain, like becoming a singleton or having a new maximum
value, which allows deciding on the appropriate propagator to apply.
Campeotto et al. obtained speedups of up to 6.61, with problems like the
Langford problem and some real problems such as the modified Renault prob-
lem [4], when comparing a sequential execution on a CPU with the hybrid
CPU/GPU version.
4 Solver Architecture
PHACT is a complete solver, capable of finding a solution for a CSP if one exists.
It is meant to be able to use all the (parallel) processing power of the devices
available on a system, such as CPUs, GPUs and MICs, to speed up solving
constraint problems.
The solver is composed of a master process which collects information about
the devices that are available on the machine, such as the number of cores and
the type of device (CPU, GPU or MIC), and calculates the number of sub-
search spaces that will be created to distribute among those devices. For each
6 P. Roque and V. Pedro
– Compute unit. One or more processing elements and their local memory. In
Nvidia GPUs each Streaming Multiprocessor (SM) is a compute unit. AMD
GPUs have their own components called Compute Units that match this
definition. For CPUs and MICs, the number of available compute units is
normally equal to or higher than the number of threads that the device can
execute simultaneously [13];
– Kernel. The code that will be executed on the devices;
– Work-item. An instance of the kernel (thread);
– Work-group. Composed of one or more work-items that will be executed
on the same compute unit, in parallel. All work-groups for one kernel on one
device have the same number of work-items;
– Host. CPU where the application responsible for managing the execution of
the kernels is run;
– Device. A device where the kernels are executed (CPU, GPU, MIC).
In the implementation described here, the master process and the threads
responsible for communicating with the devices run on the OpenCL host and
the search engines run on the devices. The OpenCL host may also constitute a
device, in which case it will be simultaneously controlling and communicating
with the devices and running search engines. Each search engine corresponds to
a work-item, and all work-items execute the same kernel code, which implements
the search engine.
Search Space Splitting and Work Distribution
For distributing the work between the devices, PHACT splits the search space
into multiple sub-search spaces. Search-space splitting is effected by partitioning
the domains of one or more of the variables of the problem, so that the resulting
sub-search spaces partition the full search space. The number and the size of the
sub-search spaces thus created depend on the number of work-items which will
be used, and may go up to a few millions.
Constraint Solving on Hybrid Systems 7
Example 1 shows the result of splitting the search space of a CSP with three
variables, V 1, V 2 and V 3, all with domain {1, 2}, into 4 sub-search spaces, SS1,
SS2, SS3 and SS4.
Example 1. Creation of 4 sub-search spaces
SS1 = {V 1 = {1}, V 2 = {1}, V 3 = {1, 2}}
SS2 = {V 1 = {1}, V 2 = {2}, V 3 = {1, 2}}
SS3 = {V 1 = {2}, V 2 = {1}, V 3 = {1, 2}}
SS4 = {V 1 = {2}, V 2 = {2}, V 3 = {1, 2}}
Since each device will have multiple search engines running in parallel, the
computed partition is organized into blocks of contiguous sub-search spaces that
will be handled by each device, one at a time. The number of sub-search spaces
that will compose each block will vary along the solving process and depends on
the performance of each device on solving the current problem.
The communicator threads running on the host launch the execution of the
search engines on the devices, hand each device one block of sub-search spaces to
explore, and coordinate the progress of the solving process as each device finishes
exploring its assigned block. The coordination of the devices consists in assessing
the state of the search, distributing more blocks to the devices, signaling to all
the devices that they should stop (when a solution has been found and only one
is wanted), or updating the current bound (in optimization problems).
Load Balancing
An essential aspect to consider when parallelizing some task is the balancing of
the work between the parallel components. Creating sub-search spaces with bal-
anced domains, when possible, is no guarantee that the amount of work involved
in exploring each of them is even similar. To compound the issue, we are dealing
with devices with differing characteristics and varying speeds, making it even
harder to statically determine an optimal, or even good, work distribution.
Achieving effective load balancing between devices with such different archi-
tectures as CPUs and GPUs is a complex task [10]. When trying to imple-
ment dynamic load balancing, two important OpenCL limitations arise, namely
when a device is executing a kernel it is not possible for it to communicate
with other devices [8], and the execution of a kernel can not be paused or
stopped. Hence, techniques like work stealing [5,17], which requires commu-
nication between threads, will not work with kernels that run independently on
different devices and load balancing must be done on the host side.
To better manage the distribution of work, the host could reduce the amount
of work it sends to the devices each time, by reducing the number of sub-search
spaces in each block. This would make the devices synchronize more frequently
on the host and allow for a finer control over the behavior of the solver. When
working with GPUs, though, the number and the size of data transfers between
the devices and the host should be as small as possible, because these are very
time consuming operations. So, a balance must be struck between the workload
of the devices and the amount of communication needed.
8 P. Roque and V. Pedro
2
If a device takes less than one second to explore a block of search spaces, most of
that time is spent communicating with the host and initializing its data structures.
Constraint Solving on Hybrid Systems 9
Table 1. Example of the calculation of blocks size when using three devices
Device Average time per Rank Remaining sub-search Size of the next block
search space (ms) spaces to explore of sub-search spaces
Device 1 0.00125 0.625 1233482 770926
Device 2 0.00236 0.331 462556 153106
Device 3 0.01782 0.044 309450 13616
Another challenge GPUs pose is that they achieve the best performance when
running hundreds or even thousands of threads simultaneously. But to use that
level of parallelism, they must have enough work to keep that many threads
busy. Otherwise, when a GPU receives a block with less sub-search spaces than
the number of threads that would allow it to achieve its best performance, the
average time needed to explore one sub-search space increases sharply.
For example the Nvidia GeForce GTX 980M takes about 1.1 s to find all the
solutions for the n-Queens 13 when splitting the problem in 742,586 sub-search
spaces, and 2.4 s when split in only 338 sub-search spaces. This challenge is also
valid for CPUs, but not so problematic due to their lesser degree of parallelism
when compared with the GPUs.
To overcome that challenge, sub-search spaces may be further divided inside
a device, by applying a multiplier factor m to the size of a block and turning
a block of sub-search spaces into a block with m times the original number of
sub-search spaces, that will be created as presented in Example 1.
Communication
To reduce the amount of data that is transferred to each device, all of them will
receive the full CSP, that is, all the constraints, variables and their domains,
at the beginning of the solving process. Afterwards, when a device must be
instructed to solve a new block of sub-search spaces, instead of sending all the
sub-search spaces to the device, only the information needed to create those
sub-search spaces is sent.
If a device is to solve sub-search spaces SS2 and SS3 from Example 1, it
will receive the information that the tree must be expanded down to depth 2,
that the values of the first variable are repeated 2 times and the values of the
second variable are repeated 1 time only (not repeated). With this information
the device will know that the values of the first variable are repeated 2 times,
so the third sub-search space (SS3) will get the second value of that variable,
and so on down to the expansion depth. The values of the variables that were
not expanded are simply copied from the original CSP that was passed to the
devices at the beginning of the solving process.
Each time a work-item needs a new sub-search space to explore, it increases
by one the number of the first/next sub-search space that is yet to be explored
on that device and creates the sub-search space corresponding to the number
before being increased. Then it will do labeling, propagation and backtracking
on that search-space, repeating all these steps until either all the sub-search
10 P. Roque and V. Pedro
spaces of that block have been explored, when all the solutions must be found,
or only one solution is wanted and one of the work-items on that device finds
a solution.
Implementation Details
Several tests were made to find the best number of work-groups to use for each
type of device. It was found that for CPUs and MICs the best results were
achieved with the same number of work-groups as the amount of compute units
of the device. For GPUs, the predefined number of work-groups is 4096 due to
the increased level of parallelism allowed by this type of devices.
The user can specify how many sub-search spaces must be created or let
PHACT estimate that number. For estimating the number of sub-search spaces
that will be generated, PHACT will sum all the work-items that will be used in
all the devices and multiply that value by 40 if all the solutions must be found
for the current CSP, or by 100 if only one solution is required or when solving an
optimization problem. After several tests these values (40 and 100) were found
as allowing to achieve a good load balancing between the devices, and as such
they are the predefined values.
When looking for just one solution or optimizing, the amount of work sent
to each device is reduced by generating more sub-search spaces and decreas-
ing the size of the blocks sent to the devices, which makes each one of them
faster to explore, to make sure all the devices are synchronized on the host more
frequently.
As for the number of work-items per work-group, CPUs and MICs are
assigned one work-item per work-group, as their compute units can only exe-
cute one thread at a time.
On the contrary, each GPU compute unit can execute more than one thread
simultaneously. For example, the Nvidia GeForce GTX 980 has 16 SMs with 128
CUDA cores3 each, making a total of 2048 CUDA cores. Nevertheless, each SM is
only capable of executing simultaneously 32 threads (using only 32 CUDA cores
at the same time) making it capable of running 512 threads simultaneously [15].
Each SM has very limited resources that are shared between work-groups and
their work-items, thus limiting the number of work-items per work-group that
can be used according to the resources needed by each work-item. The main
limitation is the size of the local memory of each SM that is shared between
all the work-items of the same work-group and between some work-groups (8
work-groups for the Nvidia GeForce GTX 980).
For this reason, PHACT estimates the best number of work-items per work-
group to use for GPUs, by limiting the amount of local memory required to the
size of the available local memory on the GPU. When the available local memory
is not enough to efficiently use at least one work-item per work-group, PHACT
will only use the global memory of the device, which is much larger but also
much slower, and 32 work-items per work-group, as each SM is only capable of
running 32 threads simultaneously.
3
A CUDA core is a processing element capable of executing one integer or floating
instruction per clock for a thread.
Constraint Solving on Hybrid Systems 11
Note that PHACT represents variable domains as either 32-bit bitmaps, mul-
tiples of 64-bit bitmaps, or as (compact) intervals. When using intervals, PHACT
is slower than when using bitmaps, but intervals are meant to be used instead
of larger bitmaps on systems where the size of the RAM is an issue.
The techniques described in this section allow PHACT to use all the devices
compatible with OpenCL to solve a CSP. It splits the search space in multiple
search spaces that are distributed among the devices in blocks to reduce the
number of communications between the host and the devices. The size of each
block is calculated according to the speed of the respective device when solving
the previous blocks to try to achieve a good load balancing between the devices.
The size of the data transfers between the devices and the host is reduced by
replacing the blocks of fully created search spaces with a small data set containing
the information needed for a device to generate those search spaces.
Tables 2, 3, 4 and 5 present the elapsed times and speedups when solving
all the problems on M1, M2, M3 and M4, respectively. Five of the six problems
models were retrieved from the Minizinc Benchmarks suite [12]. The Langford
Numbers problem was retrieved from CSPLib [9], due to the absence of reified
constraints on PHACT and PaCCS, that are used in the Minizinc Benchmarks
12 P. Roque and V. Pedro
model, which would lead to different constraints being used among the three
solvers. PaCCS does not have the “absolute value” constraint implemented, so
it was not tested with the All Interval problem.
This set of problems allowed to evaluate the solvers with 8 different con-
straints combined with each other in different ways. All the solutions were found
for the problems whose name is followed by “(Count)” on the tables, the opti-
mal solution was searched for the problem identified with “(Optim.)” and for
the problem whose name is followed by “(One)”, only one solution was required.
For simplicity, the 4 tables have the resources used on the respective machine
identified as R1, R2, R3 and R4, where R1 means using only a single thread on
the CPU, R2 means using all the threads of that CPU, R3 means using all the
threads on the CPU and one device (Geforce, Tesla, Tahiti or MIC), and R4
means using all the threads on the CPU and two identical devices (MICs or
Tahitis). It must be noted that only PHACT is capable of using R3 and R4
resources.
Table 2 shows that using the Geforce to help I7 allowed speedups of up to
4.66. However, in two problems, using also the Geforce resulted in more time
needed to solve the same problems. This result is mainly due to the small number
of work-items per work-group that was effectively used on Geforce, due to the
local memory limitations detailed in Sect. 4. On this machine, adding the Geforce
to help I7 allowed a geometric mean speedup of 1.53.
The slowdown noted when optimizing the Golomb Ruler with 12 marks is
also due to the impossibility of different devices communicating with each other
while their kernels are running, as stated in Sect. 4. This is problematic when
optimizing, as a device which finds a better solution cannot tell the other devices
to find only solutions better than the one it just found. Instead it will finish
exploring its block of sub-search spaces and only after that it will inform the host
about the new solution, and only after this point, when another device finishes
its block, it will be informed about the new solution that must be optimized.
Due to this limitation, the devices spend some time looking for solutions that
may already be worse than the ones found by other devices. This problem was
also noted on the other three machines.
As for the Langford Numbers problem with 14 numbers, the worse result
when adding the Geforce was due to the very unbalanced sub-search spaces
that are generated leading to most of sub-search spaces being easily detected as
inconsistent, and only a few containing most of the work. This is problematic,
because as each thread explores each sub-search space sequentially, in the end
only a few threads will be working on the harder sub-search spaces while the
others are idle. This problem was also noted on the other three machines.
PHACT was faster than PaCCS in all problems, achieving speedups of up
to 5.37.
When comparing with Gecode, PHACT achieved good speedups on all the
problems, except on Market Split, which is a simple problem with only one
constraint type which may have a faster propagator on Gecode. On the contrary,
with the Latin problem, Gecode was 127.85 times slower than PHACT when
Constraint Solving on Hybrid Systems 13
Table 2. Elapsed times and speedups on M1, with 4 cores and 1 GPU
using only the CPU. Gecode was slower in solving this problem with all the
CPU threads than when using only one thread, which suggests that the method
used for load balancing between threads is very inefficient for this problem. This
behavior of Gecode was noted in all the machines.
Table 3 presents the results on solving the same problems on M2. Using the
Tesla GPU to help the Xeon 1 resulted in most of the cases in a slowdown. In
fact, adding the Tesla to help Xeon 1 introduced an average slowdown of 0.84.
This is due to the fact that Tesla was the slowest GPU used on the tests, being
no match for Xeon 1. In fact, the work done by Tesla did not compensate the
time spent by Xeon 1 (host) to control Tesla (device).
On this machine, PHACT was faster than PaCCS in all but one prob-
lem, resulting in an average speedup of 1.44 favoring PHACT. Comparing with
Gecode, PHACT was faster on all the problems with all the resources combina-
tions.
The results for the M3 machine are presented in Table 4. This machine pos-
sesses the CPU used on the tests that has the greater number of cores (64),
and it is paired up with two Tahiti GPUs, that are faster than Tesla, but slower
than Geforce. So it is very hard for the Tahitis to display some performance
gains when compared with a 64 cores CPU. However, with the All Interval 15
problem, they were capable of speeding up the solving process by 1.48 times.
On average, adding the two Tahiti GPUs to help Opteron did not allow any
14 P. Roque and V. Pedro
Table 3. Elapsed times and speedups on M2, with 40 cores and 1 GPU
speedup, because the time spent by Opteron to control and communicate with
the Tahitis was similar to the time that the Opteron would take to perform the
work done by the Tahitis.
The issues with Golomb Ruler and Langford Number discussed before in this
section, were also noted on this machine.
When comparing with PaCCS, PHACT achieved speedups that ranged from
0.21 on a very small problem to 4.67. PHACT was faster than Gecode in all
the tests, except when optimizing Golomb Ruler 12 with the Opteron and one
Tahiti.
Table 5 presents the results on the M4 machine. This machine possesses two
MICs whose architecture is more similar to the CPUs than to GPUs, so, they
are more prepared for solving sequential problems than GPUs. That difference
was noted with the Langford Numbers problem, where they were capable of
achieving a speedup of 1.51 despite the unbalanced sub-search spaces. On this
machine, adding the two MICs to help Xeon 2 allowed an average speedup of
1.45. When counting all the solutions for the Costas Array 15, the two MICs
allowed a top speedup of 1.90.
When compared with PaCCS and Gecode the results are very similar to the
ones achieved on the other machines, being faster than Gecode in all but one
problem and faster than PaCCS in 19 of the 24 tests.
Constraint Solving on Hybrid Systems 15
Table 4. Elapsed times and speedups on M3, with 64 cores and 2 GPUs
Table 5. Elapsed times and speedups on M4, with 32 cores and 2 MICs
Fig. 1. Speedups when using PHACT against PaCCS and Gecode on the four machines
Constraint Solving on Hybrid Systems 17
References
1. Arbelaez, A., Codognet, P.: A GPU implementation of parallel constraint-based
local search. In: 2014 22nd Euromicro International Conference on PDP, pp. 648–
655. IEEE (2014)
2. Barták, R., Salido, M.A.: Constraint satisfaction for planning and scheduling prob-
lems. Constraints 16(3), 223–227 (2011)
3. Brailsford, S., Potts, C., Smith, B.: Constraint satisfaction problems: algorithms
and applications. Eur. J. Oper. Res. 119, 557–581 (1999)
4. Campeotto, F., Dal Palù, A., Dovier, A., Fioretto, F., Pontelli, E.: Exploring the
use of GPUs in constraint solving. In: Flatt, M., Guo, H.-F. (eds.) PADL 2014.
LNCS, vol. 8324, pp. 152–167. Springer, Cham (2014). https://doi.org/10.1007/
978-3-319-04132-2 11
5. Chu, G., Schulte, C., Stuckey, P.J.: Confidence-based work stealing in parallel
constraint programming. In: Gent, I.P. (ed.) CP 2009. LNCS, vol. 5732, pp. 226–
241. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04244-7 20
6. Diaz, D., Richoux, F., Codognet, P., Caniou, Y., Abreu, S.: Constraint-based local
search for the costas array problem. In: Hamadi, Y., Schoenauer, M. (eds.) LION
2012. LNCS, pp. 378–383. Springer, Heidelberg (2012). https://doi.org/10.1007/
978-3-642-34413-8 31
7. Filho, C., Rocha, D., Costa, M., Albuquerque, P.: Using constraint satisfaction
problem approach to solve human resource allocation problems in cooperative
health services. Expert Syst. Appl. 39(1), 385–394 (2012)
8. Gaster, B., Howes, L., Kaeli, D., Mistry, P., Schaa, D.: Heterogeneous Computing
with OpenCL. Morgan Kaufmann Publishers Inc., San Francisco (2011)
9. Jefferson, C., Miguel, I., Hnich, B., Walsh, T., Gent, I.P.: CSPLib: a problem
library for constraints (1999). http://www.csplib.org
10. Jenkins, J., Arkatkar, I., Owens, J.D., Choudhary, A., Samatova, N.F.: Lessons
learned from exploring the backtracking paradigm on the GPU. In: Jeannot, E.,
Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 425–437.
Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5 42
11. Mairy, J.-B., Deville, Y., Lecoutre, C.: Domain k-wise consistency made as sim-
ple as generalized arc consistency. In: Simonis, H. (ed.) CPAIOR 2014. LNCS,
vol. 8451, pp. 235–250. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-
07046-9 17
12. MIT: a suite of minizinc benchmarks (2017). https://github.com/MiniZinc/
minizinc-benchmarks
13. Munshi, A., Gaster, B., Mattson, T.G., Fung, J., Ginsburg, D.: OpenCL Program-
ming Guide, 1st edn. Addison-Wesley Professional, Boston (2011)
14. Nethercote, N., Stuckey, P.J., Becket, R., Brand, S., Duck, G.J., Tack, G.: MiniZinc:
towards a standard CP modelling language. In: Bessière, C. (ed.) CP 2007. LNCS,
vol. 4741, pp. 529–543. Springer, Heidelberg (2007). https://doi.org/10.1007/978-
3-540-74970-7 38
15. NVIDIA Corporation: NVIDIA GeForce GTX 980 featuring maxwell, the most
advanced GPU ever made. White paper. NVIDIA Corporation (2014)
16. Pedro, V.: Constraint programming on hierarchical multiprocessor systems. Ph.D.
thesis, Universidade de Évora (2012)
17. Pedro, V., Abreu, S.: Distributed work stealing for constraint solving. In: Vidal,
G., Zhou, N.F. (eds.) CICLOPS-WLPE 2010, Edinburgh, Scotland, U.K. (2010)
Constraint Solving on Hybrid Systems 19
18. Rolf, C.C., Kuchcinski, K.: Load-balancing methods for parallel and distributed
constraint solving. In: 2008 IEEE International Conference on Cluster Computing,
pp. 304–309, September 2008
19. Rolf, C.C., Kuchcinski, K.: Parallel solving in constraint programming. In: MCC
2010, November 2010
20. Régin, J.-C., Rezgui, M., Malapert, A.: Embarrassingly parallel search. In: Schulte,
C. (ed.) CP 2013. LNCS, vol. 8124, pp. 596–610. Springer, Heidelberg (2013).
https://doi.org/10.1007/978-3-642-40627-0 45
21. Schulte, C.: Parallel search made simple. In: Beldiceanu, N., et al. (eds.) Proceed-
ings of TRICS: CP 2000, Singapore, September 2000
22. Schulte, C., Duchier, D., Konvicka, F., Szokoli, G., Tack, G.: Generic constraint
development environment. http://www.gecode.org/
Another random document with
no related content on Scribd:
— Oh, olen niin väsynyt. Sitä on jo niin vanha.
— On, kyllä on. Joka vuosi minä niitä sieltä joukon poistan, mutta
joka vuosi niitä taas siellä on. Minä luulen, että kiviä sataa taivaasta.
— Niin niin, sinä olet aina niin hyväntahtoinen. Mutta minä, joka
olen mies, suutun, niin vanha kuin olenkin. Ja minä en kehtaa jatkaa
enään kuokkimista. Eihän jaksa enemmän kuin jaksaa.
Ukko näytti niin vihaiselta, että eukko antoi hänen olla. Mutta
mielessään ajatteli hän hiljaisuudessa tähän suuntaan:
Kun eukko kuuli, että mies oli nukkunut ja makasi nyt kuorsaten
ukonuntaan, nousi hän sängystä. Kun hän makasi seinävieressä,
täytyi hänen päästäkseen lattialle kiivetä ukon yli.
Sitten pukeutui hän hiljaa ja meni ulos viileään, puolivaloisaan
kesäyöhön. Kuokka seisoi tuvan seinää vasten niinkuin ukko sen oli
jättänyt. Eukko otti sen ja meni ulos pellolle ja alkoi työskennellä.
Hän kuokki varovaisesti, ettei rauta kilahtaisi kiviin ja ettei ukko
kuulisi mitään erityisempää ääntä ja heräisi. Kivet heitti hän ojan
reunalle, sitä mukaa kuin työssään eteni.
— No, Sven.
— Ah, herran tähden, sinäkö se olit! Mitä varten olet sinä ylhäällä
yösydännä?
Pohjakalaa,
Ei, mutta miten se söi! Hän veti »tintin» toisensa perästä, ja kohta
eivät tä'yt riittäneet, niin että hänen täytyi silpaista salakan hienoa
lihaa syötiksi.
Mutta kuinka nyt olikaan, kun hänen piti vetää ylös onkea, tarttui
se kiinni johonkin raskaasen. Joko se oli suuri kuha tai myös turska.
Hiopp, hän veti. Mutta onki ei liikahtanut. Oliko se tarttunut pohjaan?
Jokin suuri esine virui pohjassa — osa siitä oli valkeaa — kasvot.
Ja partaa sillä oli…
Hän sousi kotiin ja sanoi äidille, ettei salakka tänä päivänä syönyt.
Äiti sanoi, että hän varmaankin oli juossut jonkin tytön jälissä sen
sijaan että olisi kalastanut. Mutta poika ei uskaltanut puhua totuutta.
Eikä hän lähinnä seuraavina päivinä kalastanut. —
— Toiset pojat tahtoivat myös kalakeittoa. He sotisivat toinen
toisensa perästä kalapaikalle, mutta kaikki tulivat tyhjin käsin kotiin.
Ei siellä ollut kaloja, sanoivat he. Mutta kaikki olivat he tartuttaneet
koukkunsa Puna-Pietariin, kaikki katkaisseet siimansa ja viskanneet
salakat veteen. He eivät halunneet niitä.
— En edes kissallekaan.
— Se on huono paikka.
— Kerrassaan kehno.
— Totisesti täytyy.
Sillä välin kävi myrsky. Mutta sitten tuli taas seesteinen päivä.
Silloin arveli ensimäinen poika, että hän antaa paholaiselle koko
Puna-Pietarin — nyt menee hän koettamaan.
Niin sousi hän uudelle paikalle. Ja hän oli niin iloinen ja rallatteli,
että vuonon vuoret hänen jälestään huhuilivat. Nyt kertoisi hän pojille
saaneensa oikein paljo kaloja, mutta paikkaa hän ei neuvoisi — e-
oho!
Mutta yhtäkkiä kävi hän kalpeaksi, ja rallatus loppui — sillä onki oli
taas kiinni.
Hän säikähti niin, että oikein tahtoi itkettää. Taaskin oli siinä Puna-
Pietari — jonka virran vesi oli tuonut entiseltä vanhalta paikaltaan…
Eikö hän siitä koskaan pääsisi? Hän veti riipan ylös. Kotiin sousi
hän. Kaloja hän ei uskaltanut kaataa veteen, äiti olisi vihastunut. Ja
häpeähän olisi tulla kaksi kertaa peräkkäin kaloitta kotiin.
Kaksi kruunua.
— Miksikä niin?
— E-ei, minä olen heiltä kysynyt. He eivät tiedä, missä se on. Joku
muu sen varmaan on ottanut.
— Niin, en tiedä.
Hans oli jonkun aikaa ääneti, mutta sitten alkoi hän kuleksia
ympäri huonetta ja nuuskia Ernstin työkaluja.
— Oh, en mitään.
— Nyt sanoit sen itse. Ha, ha, kuinka pelästyneeltä sinä näytät.
Totta puhuakseni tulin tänne katsomaan, etkö sattumalta olisi
pistänyt sitä takkisi alle. Takkisi on ollut mukana näpistelyissä
ennenkin. Muistatko papin kanoja, hä?
— Vaikene!
*****
Hän sai.
Mutta sentähden ei hän ollut sinne tullut. Vaan häntä halutti niin
kovasti saada puhua tuosta puukkosahajutusta, että, hänen täytyi
tavata joku ihminen. Hän vain ei tiennyt, kuinka saisi sen niin esille,
ettei se näyttäisi liian tärkeältä. Mutta silloin näki hän vasaroita ja
viiloja ja sahoja riippuvan puodin katossa. Ja silloin sanoi hän:
— Niin. Ernst ei ole luotettava, ei. Jos osaat oikein vaieta, Hans —
Eriksson tirkisteli ympärilleen, puodissa ei ollut ketään muita — niin
saat sinä kuulla jotakin, jonka Ernst teki talvella, vaan josta minä en
ole tahtonut puhua. Sillä minä en tahdo juoruta, näes.
— Niin, näetkös —
— Sinä näytät niin ilkeältä. Sinä tahdot pahaa Ernstille. Ei, minä
pidän rahan. Se ei koskaan lähde minulta.
— Minne matka?
*****
— Niinkö? Hm.
— Kas niin, älkää nyt olko piilosilla enään. Jos te saatte hyvän
maksun, luovutte te kyllä siitä.
— Olkoon menneeksi!
— Missä on kruunu?
— Ja te ostatte ihmisiä.