inicio mail me! sindicaci;ón

Information Systems @ HPI

all about data management

Bachelor-Project ProCSIA: Column Store Benchmarks

Bachelor-Project ProCSIA: Column Store Benchmarks

Authors: Philipp Langer, Florian Westphal, Marian Gawron, Andreas Henning, Fabian Tschirschnitz, Patrick Schulze, Gary Yao, Michael Wolowyk

Introduction

Roughly one week ago the bachelor project team “ProCSIA” (Profiling Column Stores with IBM’s Information Analyzer) evaluated six mainly column-oriented DBMSs in order to find out which one of them works fastest on a given OLAP workload and on simple profiling operations.
These were mostly open source projects (in the case of Sybase IQ 15.2 a free evaluation copy was used):

This test was conducted using a TPC-H benchmark (size: 1 GiB) as well as our own simple benchmark for data profiling-like queries. The second benchmark contains column-based operations, such as calculating a minimum (select min attr from table) or finding out about a column’s frequency distribution (select attr, count(*) from table group by attr). The TPC-H SQL queries as well as the queries used in our own benchmark are attached at the end of this blog post.
Please be aware of the fact that the trial versions of the DBMSs listed above are not comprehensive enough to make solid comparisons of actual performance. However, these trials provided us a hint on which column stores are suited best for our profiling tasks.
We ran the benchmark on a Dell Optiplex 745 with an Intel Core 2 CPU 6600 @ 2,40 GHz and 4 GB of RAM of which 3,25 GB could actually be used on a 32-bit operating system.

DISCLAIMER: The bachelor project team performed these tests to the best of their knowledge. However, they are no many-years database performance tweaking engineers. Also, they aimed to find out how fast the systems are in a none-tweaked mode.

Results

Benchmark 1

We did not execute the TPC-H benchmark on MetaKit as it does not provide an SQL interface. In the figures below you can see how each DBMS performed (for detailed information about the queries, see the end of this blog post). It is interesting to see how different DBMSs perform on this small amount of data. We plan to test the DBMSs with a 100 GiB TPC-H database to find out about scaling issues.

Figure 1: TPC-H 1G load times

Figure 2: Fast running TPC-H queries

Figure 3: Fast running TPC-H queries

TPC-H overall results:

MonetDB: 14,571 sec | Power@Size: 8853,5
Vectorwise: 15,38 sec | Power@Size: 8252.1
InfiniDB: 108 sec | Power@Size: 890,5
Sybase: 262 sec | Power@Size: 448,4
Infobright: 10,36 h| Power@Size: 230,20

Infobright’s poor performance is due to the TPC-H query 21, which took 9 hours to execute. Query 21 contains complex subqueries demanding too much of the MySQL optimizer. This is an issue related to the MySQL optimizer rather than to Infobright’s implementation.

Benchmark 2

The positive impression MonetDB and Vectorwise made in the TPC-H benchmark was confirmed in our own benchmark. Infobright is very good at aggregations as it builds up a knowledge grid saving e.g. the maximum and minimum of most columns.

Figure 4: Calculating the average of a column

We then let the DBMSs calculate the minimum of a column (type varchar) directly followed by the calculation of a maximum. For example, this column was not in Infobright’s knowledge grid so the calculation of the minimum took rather long but the maximum was calculated very fast because of caching. The same phenomenon can be observed with the other databases, too (except MetaKit which works on flat files and does not provide any further optimizing).

Figure 5: Calculating the min/max of a column

The last test we conducted was a frequency distribution. For unknown reasons MonetDB took so long to execute this quite simple query that we had to abort the query.

Figure 6: Calculating the frequency-distribution of a column

Conclusion

In conclusion, we found that MonetDB, Vectorwise, and Infobright provided the most promising results for our application (while being open source products). As mentioned above we are still in the process of evaluating how the DBMSs work on a 100 GiB TPC-H database.

Alphabetical Sorting of Authors

In computer sciences a typical ordering of co-authors of a publication is alphabetical, unless there is a good reason to deviate from this order (for instance if some authors contributed considerably more than others). However, the actual ordering does become interesting for instance for graduating PhD students or professors up for tenure. Committees do consider the number of publications in which the candidate is in fact the first author.

The hypothesis that I want to verify (or better reject) is the following:

Researchers with last names that appear earlier in the alphabet have a career advantage.

Or in other words:   Does alphabetical sorting of authors give unfair advantage to researchers like Til Aach? Here, I measured career advantage simply by counting publications. The reasoning was that researchers with alphabetically early names are promoted more often, obtain more funds, etc., and thus are more successful in their research. I deliberately ignored other measures such as counting citations.

I checked this hypothesis using a 2009 DBLP data set provided by Hannah Bast. The most tricky part was to identify the first letter of the last name, because names were stored in full as in “Philip A. Bernstein”. My (still crude) approach is the following SQL query, which simply checks for the existance of a middle initial. If the author name has such an initial, the first letter after the initial is chosen, if not the first letter after the first space character is chosen:

SELECT PUBID, AUTHOR,
CASE WHEN
LOCATE(’.', AUTHOR) = 0 OR LOCATE(’Jr.’,AUTHOR) > 0 OR LOCATE(’Sr.’,AUTHOR) > 0 OR LOCATE(’.', AUTHOR) = LENGTH(AUTHOR)
THEN
SUBSTR(UPPER(AUTHOR),POSSTR(AUTHOR,’ ‘)+1,1)
ELSE
SUBSTR(UPPER(AUTHOR),LOCATE(’.', AUTHOR)+2,1)
END AS INITIAL,
EDITOR, AUTHOR_NUM
FROM DBLPAUTHORS

Obviously, this could be greatly improved. Approximately 1100 extracted letters were not among the letters A-Z. In addition, for researchers with a full middle name in DBLP, that middle name is “mistaken” for the last name. Other sources of error are persons with multiple middle initials and authors with only an initial as the first part of a name.

Next, I calculated the number of distinct publications per initial and the number of distinct persons for normalization:

SELECT DISTINCT AUTHOR, INITIAL FROM
( … view … )
ORDER BY AUTHOR

SELECT INITIAL , COUNT(*) AS ANZ
FROM
(SELECT DISTINCT AUTHOR, INITIAL FROM
(… view …)
)
GROUP BY INITIAL
ORDER BY INITIAL

The results can be seen here:

fairauthorship11

Finally, I calculated the average number of publications per person and letter:

fairauthorship2

In conclusion, we can clearly reject the hypothesis. In fact, having a last name starting with Z is apparently of great advantage.

Future work must certainly include a better parsing of the name attributes. If someone has a version of DBLP with names separated into individual fields I would gladly include those. In addition, measuring success as the number of publications is certainly not satisfactory. Counting citations might be a good (but more difficult to correctly implement) alternative.

Geschäftswolken

Ubuntu Enterprise-Wolke

More on Ubuntu Enterprise Cloud

Twitter and Blogs are so 1935!

Being a computer scientist in the 90s/00s but not a pale coke-sucking all-night gamer evokes doubts about your passion for the subject, being a computer scientist these days and neither blogging nor twittering evokes serious doubts about whether you have ever used a computer except an abacus at all. Now that I stumbled across the following article I’m somewhat relieved that despite these lines are my first public blog post I’m not the only laggard.

Twitter and Blogs are so 1935!

Twitter 1935

Google AppEngine

It somehow seems natural to use Google App Engine (GAE) for applications built with Google Web Toolkit (GWT). I was looking forward deploying the app there but first many questions arose and then, after a short roundtrip to some related blogs, it became obvious that GAE is not the right place at this time.
If you plan to deploy there sometime you should have a look at GAE’s datamodel from the very beginning.
The sandbox of GAE has important limitations and as no network connection except for HTTP(S) is possible also external databases cannot be contacted. GAE’s data store might be powerfull and be used by many Google products storing Petabytes of information but the underlying design is very much different from standard RDBMS.
Especially this blog post gives a good overview and very many related links. Some quotes:

It might look almost look like a sql db when you squint, but it’s
optimized for a totally different goal. If you think that each
different entity you retrieve could be retrieving a different disk
block from a different machine in the cluster, then suddenly things
start to make sense. avg() over a column in a sql server makes sense,
because the disk accesses are pulling blocks in a row from the same
disk (hopefully), or even better, all from the same ram on the one
computer. With DataStore, which is built on top of BigTable, which is
built on top of GFS, there ain’t no such promise. Each entity in
DataStore is quite possibly a different file in gfs.
(…)
Yes, this means that everything that we think we know about building
web applications is suddenly wrong. (…)

«What makes good Information Design?»

nice graphic and comments on “Interesting, Easy, Beautiful, True?”

good_infodesign

Integrated News

I just stumbled upon a small Berlin based company that is going live with their product: a personalized newspaper. News are integrated from several sources, including newspapers (e.g., Tagesspiegel, Bild, New York Times), and web sources (possibly pretty much everything that has an RSS feed). So far, it’s unclear to me if they integrate the news articles or whole pages. However, it’s a very interesting project. I signed up to be among the first who will receive their personalized newspaper starting November, 16th, in the Berlin area only. Their website: www.niiu.de

A data quality fairy tale

Nice little video on data quality: http://www.youtube.com/watch?v=TbzQvswrOTw

Insights are not terribly deep, but cute.

Calendar Fusion

A couple of days ago, I stumbled across a new application of data fusion: calendar fusion.

The web application at http://www.fusecal.com/ does just this: given an url, it automatically extracts calendar data from websites and creates an ical calendar, and additionally lets you combine multiple calenders into one single calendar. The integrated, fused, calendar can then be published or included in your personal organizer or favorite socal network. The application is ideal to keep track of events publish across multiple websites without checking them all, just by adding the integrated calendar into your personal organizer. I tried it by creating a calendar of four major open air cinema websites from Berlin. It works… well, yeah, it works. It’s not perfect, extracting calendar data seems not to be the easiest thing on earth. Or maybe they just read the wrong papers. The fusion part however works, but is not really difficult, as duplicate detection is totally excluded and contradicting events are plainly ignored. Some obvious improvements exist, but the system is still beta. So lets wait for the real deal.

However, it’s a cool thing, easy to use and saves you some time. See for yourself, the Berlin open air cinema calendar

Update 8/30/2009: unfortunately fusecal went offline. Seems that they couldn’t find a way to make money with it.

Cool movies about the information age

Did you know: http://www.youtube.com/watch?v=UIDLIwlzkgY

Information (r)evolution:http://www.youtube.com/watch?v=-4CV05HyAbM

The machine is Us/ing us: http://www.youtube.com/watch?v=NLlGopyXT_g

Next entries »