-Cache Design Issues
-===================
-dmacvicar@suse.de
-Solver Requisites
-=================
-
-The queries for the solver include:
-(Kindly provided by schubi)
-
-1. get all installed items
-2. get all uninstalled items
-3. get all installed/uninstalled items of a specific kind
-4. just find installed item with same kind/name as item
- Helper::findInstalledByNameAndKind (const ResPool & pool, const string &
- name, const Resolvable::Kind & kind)
-5. just find uninstalled item with same kind/name as item
- Helper::findUninstalledByNameAndKind (const ResPool & pool, const string
- & name, const Resolvable::Kind & kind)
-6. just find best (according to edition) uninstalled item with same
- kind/name as item
- // *DOES* check edition
- PoolItem_Ref Helper::findUpdateItem (const ResPool & pool, PoolItem_Ref
- item)
-7. check if the given item is the best one of the pool
- // Takes arch, edition in order to check it
- bool Helper::isBestUninstalledItem (const ResPool & pool, PoolItem_Ref item)
-8. get all requires,obsoletes,provides,recommends,... of an item
-9. find all INSTALLED items which provides foo
-10. find all UNINSTALLED items which provides foo
-11. find all items which provides foo
-12. find all locked items
-13. find items which have a freshens entry with a given name or
- corresponds with one of a given name list
-14. find items which have a supplements entry with a given name or
- corresponds with one of a given name list
-15. find items which have a conflicts entry with a given name or
- corresponds with one of a given name list
-16. find items which have a recommend entry with a given name or
- corresponds with one of a given
- name list
-17. find all INSTALLED items which requires foo
-18. find all UNINSTALLED items which requires foo
-
-Most of those can be represented by single table queries which don't
-depend on the schema at all except:
-
-* 6. requires edition comparision
-* 9. 10. 11. 17. 18. Style queries are analyzed in our example for
- different schemas
-* 13. 14. 15. 16. Same as above
-
-Other system schema comparision
-===============================
-
-xpkg - http://www.xpkg.org
---------------------------
-
-Uses a simple dependency model (name,major,minor).
-Database schema only has a packages table.
-
-YUM - http://linux.duke.edu/projects/yum
-----------------------------------------
-
-/var/cache/yum/stable/
-total 348M
-0 cachecookie
-16M filelists.xml.gz
-72M filelists.xml.gz.sqlite
-48 headers
-45M other.xml.gz
-172M other.xml.gz.sqlite
-48 packages
-6.8M primary.xml.gz
-37M primary.xml.gz.sqlite
-951 repomd.xml
-
-The schema is basically a 1:1 copy of the XML schema with
-different databases per xml file. They also keep the raw cache.
-
-The names are not normalized and it is easy to see it looking
-at the file sizes.
-
-Summaries and descriptions are together with resolvable data.
-
-The time to create factory cache from scratch via NFS is:
-
-real 2m20.947s
-user 1m47.155s
-sys 0m9.697s
-
-Once is created, looking for information about a file:
-
-real 0m12.470s
-user 0m7.028s
-sys 0m0.464s
-
-Solver queries with yum:
-
-Find package that provides foo:
-
- sqlite> select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3";
- kdelibs3|3.5.6|4|0
- kdelibs3|3.5.6|4|0
-
- real 0m0.008s
- user 0m0.008s
- sys 0m0.000s
-
- sqlite> explain query plan select pk.name,pk.version,pk.release,pk.epoch from packages pk,provides pr where pk.pkgKey=pr.pkgKey and pr.name="kdelibs3";
- 0|1|TABLE provides AS pr WITH INDEX providesname
- 1|0|TABLE packages AS pk USING PRIMARY KEY
- sqlite>
-
-Smart
------
-
-Smart uses its own cache format. Creating it for stable-x86
-results in a 6.4M cache file. Creating it takes:
-
- real 2m18.128s
- user 0m31.826s
- sys 0m2.740s
-
-And later, searching for a package:
-
- real 0m3.198s
- user 0m2.628s
- sys 0m0.444s
-
-ZYPP Proposed schema Tests
---------------------------
-
-The current schema available in zypp-trunk.
-
-Find package that provides foo:
-
- select r.name,r.version,r.release,r.epoch from resolvables r,dependencies d,versioned_dependencies vd,names n where d.resolvable_id=r.id and d.dependency_type=1 and vd.name_id=n.id and n.name="kdelibs3" and vd.dependency_id=d.id;
- kdelibs3|3.5.6|4|
- kdelibs3|3.5.6|4|
-
- real 0m0.695s
- user 0m0.636s
- sys 0m0.036s
-
-Slow, this query is resolved as:
-
- 0|2|TABLE versioned_dependencies AS vd
- 1|1|TABLE dependencies AS d USING PRIMARY KEY
- 2|0|TABLE resolvables AS r USING PRIMARY KEY
- 3|3|TABLE names AS n USING PRIMARY KEY
-
-Trying again with a new index:
-
- CREATE INDEX dependency_type ON dependencies (dependency_type);
- 12M
-
- real 0m0.485s
- user 0m0.456s
- sys 0m0.024s
-
-Creating an index for the versioned_dependencies:
-
- CREATE INDEX versioned_dependencies_dependency ON versioned_dependencies (dependency_id);
-
- 0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1
- 1|1|TABLE dependencies AS d WITH INDEX dependency_type
- 2|0|TABLE resolvables AS r USING PRIMARY KEY
- 3|2|TABLE versioned_dependencies AS vd WITH INDEX versioned_dependencies_dependency
-
- real 0m0.579s
- user 0m0.552s
- sys 0m0.016s
-
-Still slow. Last try:
-
-create index vd_name_id on versioned_dependencies(name_id);
-
- real 0m0.007s
- user 0m0.004s
- sys 0m0.004s
-
-This means the bottleneck was in the normalized names table.
-Now the query is done in this way:
-
- 0|3|TABLE names AS n WITH INDEX sqlite_autoindex_names_1
- 1|2|TABLE versioned_dependencies AS vd WITH INDEX vd_name_id
- 2|1|TABLE dependencies AS d USING PRIMARY KEY
- 3|0|TABLE resolvables AS r USING PRIMARY KEY
-
-Adding all those indexes keep the database (without file lists and
-descriptions) in a 15 M size. Some extra indexes can be dropped
-if the query plan shows they are not used.
-
-With 40.000 packages. A database without normalized names uses 50mb, while a normalized one uses
-around 43 mb.
-
-The query in the normalized one takes 2.092s. Adding the names index bumps the size to
-50mb and gets the query in 0.010s.
-
-The query in the non-normalized one takes 0.443s. Adding the names index bumps the size to
-76 mb and gets the query in 0.010s.
-
-
-stable-x86 distribution facts [^onerepo]
-========================================
-
-* resolvable names: 8.204
-* unique resolvable names: 8.200
-* dependency names: 119.425
-* unique dependency names: 30.062
-
- [^onerepo]: Note that those numbers are only 1 repository.
-
-If we take the proportions as a fact for every repository, that means 14.5
-dependencies per package in average.
-
-If you take in count that adding a repository which is not factory has not much
-chance of adding much new names the unique 30k unique dependency names will not
-grow too much, where the number of dependencies still grows up.
-
-Glob queries (ie: zypper)
-
- sqlite> explain query plan select * from dependencies where name like "kde%";
- 0|0|TABLE dependencies
-
-As we see, this query does not use any index, because, ok, there is none.
-
- sqlite> CREATE INDEX dependency_name ON dependencies (name);
- sqlite> explain query plan select * from dependencies where name like "kde%";
- 0|0|TABLE dependencies
-
-This makes sense, as from [1], "The GLOB and LIKE operators are expensive in
-SQLite because they can't make use of an index. One reason is that these are
-implemented by user functions which can be overridden". Index can be bypassed,
-if they slowdown the query:
-
-Example: "WHERE x=5" can be changed to "WHERE x+0=5" or "WHERE +x=5" to bypass the index.
-( [1] 5.5.3 )
-
-In contrast, this query does use an index:
-
- sqlite> explain query plan select * from resolvables r,dependencies d where d.name="kdelibs";
- 0|0|TABLE resolvables AS r
- 1|1|TABLE dependencies AS d WITH INDEX dependency_name
-
-Database Size
-=============
-
-A zmd style schema made with repotools, of stable-x86 takes 5.2 M
-It is unrealistic as it does not contains descriptions but useful for comparision.
-
- CREATE INDEX dependency_name ON dependencies (name);
-
-Ths increases the size of the database to 8.4M!!.
-
-We normalize the database, taking the names to a new table.
-
- insert into dependencies (id, name_id, deptype_id, resolvable_id, refers_id, relation_id, epoch, version, release, pre ) select db.id, n.id, db.deptype_id, db.resolvable_id, db.refers_id, db.relation_id, db.epoch, db.version, db.release, db.pre from dependencies_backup db, names n where n.name=db.name;
-
-The new result is a 30kb difference, after vacuum. (????????)
-
- 5256192 stable.db
- 5283840 stable.old.db
-
- sqlite> select sum(length(name)) from names;
- 563224
- sqlite> select sum(length(name)) from dependencies;
- 1922000
-
-We should have saved at least 1.922.000 - 563.224 ?????
-
-However the index size for the names column is much smaller.
-The database without normalized names jumps from 5.1M to 6.0M when
-creating a name index, instead of the 8.4M mentioned above.
-
-Growing Test
-============
-
-zypp1.db uses name normalization for dependencies. zypp2.db does not.
-
-8.000 packages
-9.9M zypp1.db
-9.8M zypp2.db
-
-16.000 packages
-18M zypp1.db
-20M zypp2.db
-
-24.000 packages
-26M zypp1.db
-30M zypp2.db
-
-32.000 packages
-34M zypp1.db
-40M zypp2.db
-
-40.000 packages
-43M zypp1.db
-50M zypp2.db
-
-If x = thousand of packages, and y = database size. A normalized
-schema has a slomp of 1.1 and a unnormalized schema a slomp of 1.25
-
-File list data
-==============
-
-For factory. Complete file list is
-
-2.840.491 files (unique names: 890.373)
-119.070 directories
-
-We ommited using a schema with one table and one column.
-
-Test 1, Normalized Directories
-
-|-----------------| |----------------------------------|
-| dirs | | files |
-|-----------------| |----------------------------------|
-| id | name | | dir_id | package_id | name |
-|-----------------| |----------------------------------|
-
-For yum xml format, this is 192M in filelists.xml which can be
-compressed to 17M filelists.xml.gz
-
-Using a simple schema normalizing the directories:
-
-186M filelist.db
-
-Obtaining the file list of a package:
-
-select d.name,f.name from files f,dirs d where f.dir_id=d.id and f.package_id=7557
-takes 0.135s and was not able to improve it more using indexes. This assumes you have the
-key of the package.
-
-Reverse lookup, find the id of a package that owns file and dir.
-
-'select f.package_id from files f,dirs d where f.dir_id=d.id and f.name="libzypp.so" and d.name="/usr/lib";'
-takes 1.582s
-
-Adding a index on file name, grows the database to 259M, but resolves the query in 0.007s.
-
-Test 1, Normalized Directories and file names
-
-|-----------------| |-----------------|
-| dir_names | | file_names |
-|-----------------| |-----------------|
-| id | name | | id | name |
-|-----------------| |-----------------|
-
-|----------------------------------|
-| files |
-|----------------------------------|
-| dir_id | file_id | package_id |
-|----------------------------------|
-
-This approach generates much slower insertions, which can be reduced with some
-application level cache, for example for the directories, which probably get
-inserted lot of files for the same directory. You can avoid N selects, keeping
-the dir_id and name in memory, where N is the average number of files per directory.
-
-This generates a 193M filelist2.db. This includes an index on filenames, otherwise
-every insertion requires a select on a non indexed text column. So this size is
-comparable to the 259M non-normalized table.
-
-I could not get the reverse lookup query under a second.
-
-Application level cache for normalized insertions
-=================================================
-
-|-----------------| |-----------------|
-| table1 | | table2 |
-|-----------------| |-----------------|
-| id | name | | id | name |
-|-----------------| |-----------------|
-
-|----------------------------------|
-| table1_table1 |
-|----------------------------------|
-| t1_id | t2_id | other |
-|----------------------------------|
-
-If you are inserting a list of:
-
-table1 table2 other
------------------------
-dog lalal 423432
-dog lelele 423432
-cat foo 423423
-cat bleh 423423
-cat ruby 423423
-duck nah 432432
-
-If you insert in table1_table1 you will need to
-check if an id already exists for table1 and table2,
-using text comparision, which is slow.
-
-As we insert more than 1 table2 object for the same
-table1 object, the id of table1 can be cached, and
-the followings won't need another select statement.
-
-Representing in memory data
-===========================
-
-There are entities that are in memory, like Architectures and we dont
-want to maintain them in the database store but we would like to be able
-to relate database information with memory information.
-
-Virtual tables allow you to do this, so you can create a architectures table
-that is not stored on disk that is only a set of callbacks to answer queries
-but you can still use it in SQL.
-
-See [3] for more information about virtual tables.
-
-C style API vs C++ Style API
-============================
-
-1000 insertions and iterating using a select
-
- (sqlite3x insert) 2:05 (u 0.07 s 1.09 c 0.00)
- (sqlite3x select) 0 (u 0.01 s 0.00 c 0.00)
-
- (sqlite-c insert) 2:05 (u 0.09 s 0.89 c 0.00)
- (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00)
-
-Oh sqlite is really slow to insert!!! no, I had autocommit enabled so there
-was one transaction per insert, if we do all inserts as a transaction
-
- (sqlite3x insert) 0 (u 0.01 s 0.00 c 0.00)
- (sqlite3x select) 0 (u 0.00 s 0.00 c 0.00)
-
- (sqlite-c insert) 1 (u 0.01 s 0.00 c 0.00)
- (sqlite-c select) 0 (u 0.00 s 0.00 c 0.00)
-
-Ok, that is fast, lets do 10.000 inserts in a row.
-
- (sqlite3x insert) 0 (u 0.20 s 0.00 c 0.00)
- (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
-
- (sqlite-c insert) 0 (u 0.20 s 0.02 c 0.00)
- (sqlite-c select) 0 (u 0.03 s 0.00 c 0.00)
-
-repeat same test with flushed caches:
-
- (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
- (sqlite3x select) 0 (u 0.04 s 0.00 c 0.00)
-
- (sqlite-c insert) 1 (u 0.18 s 0.00 c 0.00)
- (sqlite-c select) 0 (u 0.04 s 0.00 c 0.00)
-
-Conclusion, the c++ layer does not add a significant performance penalty but
-does reduce the code by 30%-40% as errors are better handled using exceptions
-instead of checking error values. same with transactions using block scopes.
-
-Some Conclusions
-================
-
-- Normalization lead to slower queries (due to JOINs)
-- Normalized tables are faster than non-normalized tables
- using an index.
-- When both use an index, they reach the 0.01s lower limit
-- The index size is much smaller in normalized tables.
-
-Recommendations
-===============
-
-[1] Dependencies names normalization can save space. Especially with
- indexes.
-[2] Dependency type normalization is cheap and can provide a flexible
- capability representation.
-[3] separating dependency types in different tables can also save
- a JOIN, but if combined with [2], leads to (dep types)X(capability impls)
- tables.
-
-
-References
-
- [1]: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
- [2]: http://developer.mozilla.org/en/docs/Storage:Performance
- [3]: http://www.sqlite.org/cvstrac/wiki?p=VirtualTables
- [4]: http://katastrophos.net/andre/blog/2007/01/04/sqlite-performance-tuning-and-optimization-on-embedded-systems/
-
\ No newline at end of file
+See:
+http://en.opensuse.org/Libzypp/Refactoring/CacheSchema
\ No newline at end of file