Efficient provenance storage
Proceedings of the 2008 ACM SIGMOD international conference on Management of …, 2008•dl.acm.org
As the world is increasingly networked and digitized, the data we store has more and more
frequently been chopped, baked, diced and stewed. In consequence, there is an increasing
need to store and manage provenance for each data item stored in a database, describing
exactly where it came from, and what manipulations have been applied to it. Storage of the
complete provenance of each data item can become prohibitively expensive. In this paper,
we identify important properties of provenance that can be used to considerably reduce the …
frequently been chopped, baked, diced and stewed. In consequence, there is an increasing
need to store and manage provenance for each data item stored in a database, describing
exactly where it came from, and what manipulations have been applied to it. Storage of the
complete provenance of each data item can become prohibitively expensive. In this paper,
we identify important properties of provenance that can be used to considerably reduce the …
As the world is increasingly networked and digitized, the data we store has more and more frequently been chopped, baked, diced and stewed. In consequence, there is an increasing need to store and manage provenance for each data item stored in a database, describing exactly where it came from, and what manipulations have been applied to it. Storage of the complete provenance of each data item can become prohibitively expensive. In this paper, we identify important properties of provenance that can be used to considerably reduce the amount of storage required.
We identify three different techniques: a family of factorization processes and two methods based on inheritance, to decrease the amount of storage required for provenance. We have used the techniques described in this work to significantly reduce the provenance storage costs associated with constructing MiMI [22], a warehouse of data regarding protein interactions, as well as two provenance stores, Karma [31] and PReServ [20], produced through workflow execution. In these real provenance sets, we were able to reduce the size of the provenance by up to a factor of 20. Additionally, we show that this reduced store can be queried efficiently and further that incremental changes can be made inexpensively.
ACM Digital Library