Persistent Storage System ~~~~~~~~~~~~~~~~~~~~~~~~~ ------- Purpose ------- The Storage System is responsible for persistently storing and loading Patches, Atoms and similar objects for the packagemanager. Other object kinds might be added in future. ---------- Operations ---------- Required operations are: - read all objects of a given kind - read in objects identified by a list of tuples (attribute name, match criteria, value) - change status - save a new object - delete an object (only for cleanup after update, does not need to be fast) - a query interface similar to what rpm offers ----------- Constraints ----------- In order of relevance: - It will be part of the ZYPP library, and as such needs to be reentrant and thread safe. - There is only one attribute in an object that might need to be modified, the status. Thus, no universal "modify object" operation is needed. - Most time critical operation is "read all objects", since this will defer the startup of the YaST packagemanager. This already takes an uncomfortably long time. - It should try to cut down on memory usage, since it will be in use at install time, at least when updating. - We'll probably get something in the order of 1000-5000 of these objects. - It must be possible to use different backends for the actual low level storage, i.e. Berkley db, postgres, mysql, flat file. ------------ Architecture ------------ Here's a layer diagram of the main architecture. The paranthesed figures indicate APIs:: +--------------------+--------------+ | caller, e.g. package manager | +--------(1)---------+ + | query interface | | +-------------------(2)-------------+ | Persistent Storage Core | +--------(3)---------+------(4)-----+ | Backend plugin | Kind plugin | +--------------------+--------------+ | Backend | Parser | | (e.g., Berkley DB) | | +--------------------+ + | Filesystem | | +--------------------+--------------+ APIs: 1) Query API 2) Core API 3) Backend Plugin API 4) Kind Plugin API These components are described below, the APIs are subject to a later, more detailed follow-up. Query Interface =============== This implements the rpm-like query operation by using general and simpler search operations within the Core API. FIXME: Insert list of operations. Persistent Storage Core ======================= The Core is the "main contact" for the layers above and breaks down the complete functionality for the simpler modules in the lower layer. FIXME: Insert list of operations. Backend Plugin ============== The backend plugin considers data objects in a form that is suitable for the backend, i.e. a data record is represented by: - an XML string - a set of keys, which are - pairs (attribute name, value) The following operations are available: - create the storage database - find records by attributes and return a list of handles. - insert a new record and return a handle to it These operations affect the record that is referenced by a handle: - read the record - update the status - remove it (delete it) Kind plugins ============ These are responsible for dealing with the specifics of the kind of objects that are stored. This includes: - knowing about the path names of the database - knowing about the attributes and keys - converting the internal representation of a data object to XML and back - extracting the key values from an XML string Parser ====== In the backend, only XML strings are stored as contents, and additionally indexes for fast access. For this, we need a parser which creates simple structure-like objects from the XML string. This is derived from the already implemented XMLNodeIterator class. -------- Backends -------- This is an overview of backends that look promising for handling the low level storage. Plain Files =========== Basic idea: Store the data for each object as an XML element in a flat file, store the status separately and add indices for fast access. For each kind there is - a master file, which contains: - the pathnames of the other files - storage options - one or more data files, which contains all the data as XML, except the status - the status for each object as a binary file, each byte representing the status for one object. The association between status byte and object is contained in an index file. - one or more index files (for all data fields that need to be a key). The index could either be an ordered flat file to mmap, or something like a tree or hash table. (I'm still looking for a suitable library). Advantage: More direct control over everything, can be tailored to our needs, probably the efficients way to do it. Disadvantage: Needs a lot of effort to do it right and get a good performance. Berkley DB ========== Berkley DB is a small non-relational embedded database library that provides compact transparent data storage with a wealth of related services. Data is managed as (key,value) tuples, and indexing is performed for the keys. Basic idea: Store the data for each object as an XML string in a berkley db. Fast access and everything is managed by the db. The Berkley db is already used within rpm, so it's no additional dependency. The library is 1-2MB (depending whether you use the C++ binding or only C). Another option: Instead of saving the XML string, serialize the ParserData structure and save this one directly. Faster, but more difficult to access for debugging. Advantage: Everything's there for queries with a single index. Disadvantage: Not so good on multiple indexes. This needs some effort. Side note: xmldb cannot be used since it needs a lot of special libraries with specific versions. For this reason, it hasn't made it on the SUSE Linux 10.0. SQLite ====== SQLite is an embedded and very small SQL engine that contains both server and client, and has no multi-user access support. It is very small (200-300 KByte). Basic idea: Store the data as an XML string in a record. For each key data field, add another attribute. Disadvantage: We don't really need the relational aspects if SQLite, and we don't have many people with knowledge of this topic in the theme. Not much experience with it. Additional library in the install system. Probably worst performance (but SQLite claims being fast for SQL). Advantage: Maximum flexibility, especially in queries. Easy changes in the structure.