Information Systems @ HPI
all about data management
Sensitive research under public scrutiny
On Monday June 4 2012, Schufa, Germany’s largest and most well-known credit-rating agency, and the Hasso Plattner Institute (HPI), a privately funded CS institute at the University of Potsdam in Germany, jointly announce the inauguration of a research lab, dubbed SchufaLab@HPI. Its goal is to explore both the economic value and the societal risk of web data, including but certainly not limited to data from social networks. The lab is planned for a 3 year period with a possible extension for another two with funding two full-time employed researchers, hardware, travel, etc. One day later, on Tuesday, internal documents about the lab, including the formal cooperation agreement, slides from the planning phase, and a list of project ideas are leaked to a journalist of NDR, a national public radio and TV network. The contract outlines the broad research goals of the lab and an informal list of project ideas: the result of a brainstorming session including a wide range of technologies and their potential application areas. They include somewhat harmless avenues, such as sentiment analysis for products, but also ethically more debatable ideas, such as matching customer records to profiles in social networks. Many of the ideas are half-baked and thus the document explicitly includes a disclaimer stating that their execution depends on availability of the data and their legality under the very strict privacy laws in Germany. In the afternoon a member of Schufa’s executive board and I undertake a long conversation with the journalist to explain what is in fact planned and emphasize that clearly all research will be performed within legal and ethical boundaries. We also reiterate the explicit goal of exploring the societal risk, the research nature of the lab, and the importance of publicly showing the capabilities of current IT methods to extract knowledge from web data.
Nevertheless, seeking a scandal, on Thursday a report is published by NDR. It indirectly suggests a dark and secretive ploy by Schufa and HPI to undermine citizen’s privacy through illegal means . On that day and the following, both Schufa and HPI face a storm that is full-blown by any measure, including wide coverage on national TV and radio, reports and commentaries in every German newspaper, etc. Even two federal ministers, the secretary of justice and the secretary of food, agriculture, and consumer protection, issue statements to the press demanding the project to be halted. In fact, one of them, in a subsequent correspondence, maintains that using data from social networks for commercial purposes is “unthinkable”, completely ignoring the fact that it is both legal and common practice in Germany and all over the world. In fact, such commercial use is arguably the main driver of the internet. Finally, the German Ethics Council, which normally advises parliament on matters of bioethics and other medical matters, weighs in against the project. These statements are made without ever contacting us to even find out what in fact was planned. Tweets and blogs go wild, and a slew of hate-mails against me and PhD students in my group ensue. Some discuss ethical issues of collecting private data, others insult, and still others threaten my students to never be able to get a job after having been associated with HPI and this project. More creative Twitterers send out tongue-in-cheek tweets mentioning their riches – in the hope of raising their credit rating: “My bank called: The account is full and I should open another one. #twitternfuerdieschufa” 
With the help of HPI’s public relations expert and in coordination with Schufa’s PR department I experience three crazy and sleepless days of nonstop interviews for TV and radio, conversations with journalists, phone calls, emails, and strategy meetings.
Is such public outrage justified? Which ethical and legal responsibilities do we data(base) researchers have when handling data about individuals?
Searching for and analyzing publicly available data is legal in Germany and likely so in most countries; even for business purposes and even if the data is about individuals. Second, analyzing social network data is common practice in research and industry: Of course the social networks themselves make use of their data – without clustering, classification and further analytics their business models are moot. There are companies that expressly specialize in social network analysis for instance to support recruitment specialists. I have since received numerous requests for cooperation from companies following similar goals, but of course not with the intent of publishing their techniques. Many large and small software vendors support access to social network data for BI or CRM tasks. Finally, consider a life insurance representative deciding on the insurance rate for a customer: looking up the leisure activities of the customer will certainly influence his decision. In conclusion, commercial use of web data and social network data is happening, regardless of its legality; it will most likely expand; and it will not go away. A few days into the storm, more prudent newspaper articles appear stating as much, and for instance give advice to consumers on how to configure their Facebook privacy settings to protect themselves against such analysis.
Much research on social network data analysis has been performed and published – some by research groups of the social network companies themselves, more by independent researchers. They have obtained datasets from the networks directly, have crawled that data, or have used publicly available data, such as the details of 100 million Facebook users published as a torrent, the Enron E-Mail data set published by court-order, the infamous AOL search logs, or the Netflix data published in the context of a competition.
As researchers, we are often given more leeway than commercial enterprises; under the assumption that the public benefits from our results and that the data and methods are treated responsibly. When researching methods to collect data that is considered private, even if it was intentionally or unintentionally made publicly available, it is important to protect any data that was gathered. But it is as important to make the public aware of the ready availability of such data, and of the capabilities of advanced IT to analyze that data and automatically draw further, implicit conclusions. These conclusions, such as gender, profession, age, sexual orientation, health, education level, etc., need not be based only on the data about the individual, but also on data from his social context and using training data solicited from, or gathered about a broader population. That is, a blogger need not explicitly state that she is vegan; it might be deduced from implicit signals in her texts, from her friend’s statuses, or by matching her blog with her Facebook profile.
In my opinion, it is important to publicly and transparently perform and publish such research, and to not leave this field to private organizations with commercial interests. Research results can serve to educate the public, can provide tools for individuals to monitor their online presence beyond what a search engine result delivers, and can thus level the playing field. It is not enough to fret about the potential abilities of IT without studying them and after careful analysis drawing the right conclusions, such as changes in policy and the establishment of active consumer protection measures.
With the storm still raging strong on Friday and no signs of it abating, HPI decides to terminate the project. We concluded that it had become impossible to perform serious research at that noise level and with the necessary serenity. What is more, my responsibility for my PhD students demands to protect them from the undue and at times insulting attacks. While some journalists, politicians, and individuals express satisfaction that the project was stopped (still ignoring the fact that data collection and analysis is performed elsewhere anyway, just not in a transparent setting), others are disappointed that we had caved to public pressure. A new wave of interviews was necessary to justify the decision, but the storm died down as fast as it had gathered.
What are the lessons learned? First, it is very difficult to explain (and justify) complex research issues to laymen and journalists. The broad set of research questions around web data was reduced to the catchphrase “Schufa crawls Facebook for credit-rating”.
Second, there is an immense lack of privacy awareness: A surprising number of people are deeply convinced that whatever they write or post on Facebook or other social networks is private and they have no inkling of privacy settings. A presumably well-educated journalist indignantly drew the analogy of entering his home and taking pictures of all his private documents. I replied that a more fitting analogy would be his displaying the documents on the sidewalk of his home, and indexing it for easy reference.
Third, there is no arguing against a storm. Spending three entire days in interviews, appearing on TV and radio, and drafting press releases and responses to questions made no apparent dent to the negative coverage. Of course this lesson does not excuse from defending one’s research.
Fourth, while I do not regret having initiated the project and stand to its goals, next time when planning a project that might involve private data it should include more stakeholders, such as experts in sociology, politicians, ethicists, data protection officers, etc., to ensure a legal and ethical procedure, and to convince the public that such research is useful and important. Also, one should draft agreements, proposals, and project goals as if they were to be published. This measure protects against such PR catastrophes as the one I experienced, but also ensures early deliberation on possible ethical problems. While these conclusions seem obvious, I am simultaneously convinced that such measures are rarely taken in research reality.
As of now, research in Germany with or about data from social networks is tabooed: Merely the mentioning of the use of Twitter or Facebook data in the press is met with emotions ranging from skepticism to outright shock and rejection. Thus, research and development of such techniques shall be left to other countries and to private corporations with no transparency or willingness to publish their results. Or performed by researchers hoping that journalists do not read WWW, SIGIR, or CIKM proceedings…
 Translated from http://twitter.com/JustElex/status/210712374271946752
Bachelor-Project ProCSIA: Column Store Benchmarks
Authors: Philipp Langer, Florian Westphal, Marian Gawron, Andreas Henning, Fabian Tschirschnitz, Patrick Schulze, Gary Yao, Michael Wolowyk
Roughly one week ago the bachelor project team “ProCSIA” (Profiling Column Stores with IBM’s Information Analyzer) evaluated six mainly column-oriented DBMSs in order to find out which one of them works fastest on a given OLAP workload and on simple profiling operations.
These were mostly open source projects (in the case of Sybase IQ 15.2 a free evaluation copy was used):
This test was conducted using a TPC-H benchmark (size: 1 GiB) as well as our own simple benchmark for data profiling-like queries. The second benchmark contains column-based operations, such as calculating a minimum (select min attr from table) or finding out about a column’s frequency distribution (select attr, count(*) from table group by attr). The TPC-H SQL queries as well as the queries used in our own benchmark are attached at the end of this blog post.
Please be aware of the fact that the trial versions of the DBMSs listed above are not comprehensive enough to make solid comparisons of actual performance. However, these trials provided us a hint on which column stores are suited best for our profiling tasks.
We ran the benchmark on a Dell Optiplex 745 with an Intel Core 2 CPU 6600 @ 2,40 GHz and 4 GB of RAM of which 3,25 GB could actually be used on a 32-bit operating system.
DISCLAIMER: The bachelor project team performed these tests to the best of their knowledge. However, they are no many-years database performance tweaking engineers. Also, they aimed to find out how fast the systems are in a none-tweaked mode.
We did not execute the TPC-H benchmark on MetaKit as it does not provide an SQL interface. In the figures below you can see how each DBMS performed (for detailed information about the queries, see the end of this blog post). It is interesting to see how different DBMSs perform on this small amount of data. We plan to test the DBMSs with a 100 GiB TPC-H database to find out about scaling issues.
TPC-H overall results:
MonetDB: 14,571 sec | Power@Size: 8853,5
Vectorwise: 15,38 sec | Power@Size: 8252.1
InfiniDB: 108 sec | Power@Size: 890,5
Sybase: 262 sec | Power@Size: 448,4
Infobright: 10,36 h| Power@Size: 230,20
Infobright’s poor performance is due to the TPC-H query 21, which took 9 hours to execute. Query 21 contains complex subqueries demanding too much of the MySQL optimizer. This is an issue related to the MySQL optimizer rather than to Infobright’s implementation.
The positive impression MonetDB and Vectorwise made in the TPC-H benchmark was confirmed in our own benchmark. Infobright is very good at aggregations as it builds up a knowledge grid saving e.g. the maximum and minimum of most columns.
We then let the DBMSs calculate the minimum of a column (type varchar) directly followed by the calculation of a maximum. For example, this column was not in Infobright’s knowledge grid so the calculation of the minimum took rather long but the maximum was calculated very fast because of caching. The same phenomenon can be observed with the other databases, too (except MetaKit which works on flat files and does not provide any further optimizing).
The last test we conducted was a frequency distribution. For unknown reasons MonetDB took so long to execute this quite simple query that we had to abort the query.
In conclusion, we found that MonetDB, Vectorwise, and Infobright provided the most promising results for our application (while being open source products). As mentioned above we are still in the process of evaluating how the DBMSs work on a 100 GiB TPC-H database.
In computer sciences a typical ordering of co-authors of a publication is alphabetical, unless there is a good reason to deviate from this order (for instance if some authors contributed considerably more than others). However, the actual ordering does become interesting for instance for graduating PhD students or professors up for tenure. Committees do consider the number of publications in which the candidate is in fact the first author.
The hypothesis that I want to verify (or better reject) is the following:
Researchers with last names that appear earlier in the alphabet have a career advantage.
Or in other words: Does alphabetical sorting of authors give unfair advantage to researchers like Til Aach? Here, I measured career advantage simply by counting publications. The reasoning was that researchers with alphabetically early names are promoted more often, obtain more funds, etc., and thus are more successful in their research. I deliberately ignored other measures such as counting citations.
I checked this hypothesis using a 2009 DBLP data set provided by Hannah Bast. The most tricky part was to identify the first letter of the last name, because names were stored in full as in “Philip A. Bernstein”. My (still crude) approach is the following SQL query, which simply checks for the existance of a middle initial. If the author name has such an initial, the first letter after the initial is chosen, if not the first letter after the first space character is chosen:
SELECT PUBID, AUTHOR,
LOCATE(‘.’, AUTHOR) = 0 OR LOCATE(‘Jr.’,AUTHOR) > 0 OR LOCATE(‘Sr.’,AUTHOR) > 0 OR LOCATE(‘.’, AUTHOR) = LENGTH(AUTHOR)
END AS INITIAL,
Obviously, this could be greatly improved. Approximately 1100 extracted letters were not among the letters A-Z. In addition, for researchers with a full middle name in DBLP, that middle name is “mistaken” for the last name. Other sources of error are persons with multiple middle initials and authors with only an initial as the first part of a name.
Next, I calculated the number of distinct publications per initial and the number of distinct persons for normalization:
SELECT DISTINCT AUTHOR, INITIAL FROM
( … view … )
ORDER BY AUTHOR
SELECT INITIAL , COUNT(*) AS ANZ
(SELECT DISTINCT AUTHOR, INITIAL FROM
(… view …)
GROUP BY INITIAL
ORDER BY INITIAL
The results can be seen here:
Finally, I calculated the average number of publications per person and letter:
In conclusion, we can clearly reject the hypothesis. In fact, having a last name starting with Z is apparently of great advantage.
Future work must certainly include a better parsing of the name attributes. If someone has a version of DBLP with names separated into individual fields I would gladly include those. In addition, measuring success as the number of publications is certainly not satisfactory. Counting citations might be a good (but more difficult to correctly implement) alternative.
Being a computer scientist in the 90s/00s but not a pale coke-sucking all-night gamer evokes doubts about your passion for the subject, being a computer scientist these days and neither blogging nor twittering evokes serious doubts about whether you have ever used a computer except an abacus at all. Now that I stumbled across the following article I’m somewhat relieved that despite these lines are my first public blog post I’m not the only laggard.
Twitter and Blogs are so 1935!
It somehow seems natural to use Google App Engine (GAE) for applications built with Google Web Toolkit (GWT). I was looking forward deploying the app there but first many questions arose and then, after a short roundtrip to some related blogs, it became obvious that GAE is not the right place at this time.
If you plan to deploy there sometime you should have a look at GAE’s datamodel from the very beginning.
The sandbox of GAE has important limitations and as no network connection except for HTTP(S) is possible also external databases cannot be contacted. GAE’s data store might be powerfull and be used by many Google products storing Petabytes of information but the underlying design is very much different from standard RDBMS.
Especially this blog post gives a good overview and very many related links. Some quotes:
It might look almost look like a sql db when you squint, but it’s
optimized for a totally different goal. If you think that each
different entity you retrieve could be retrieving a different disk
block from a different machine in the cluster, then suddenly things
start to make sense. avg() over a column in a sql server makes sense,
because the disk accesses are pulling blocks in a row from the same
disk (hopefully), or even better, all from the same ram on the one
computer. With DataStore, which is built on top of BigTable, which is
built on top of GFS, there ain’t no such promise. Each entity in
DataStore is quite possibly a different file in gfs.
Yes, this means that everything that we think we know about building
web applications is suddenly wrong. (…)
nice graphic and comments on “Interesting, Easy, Beautiful, True?”
I just stumbled upon a small Berlin based company that is going live with their product: a personalized newspaper. News are integrated from several sources, including newspapers (e.g., Tagesspiegel, Bild, New York Times), and web sources (possibly pretty much everything that has an RSS feed). So far, it’s unclear to me if they integrate the news articles or whole pages. However, it’s a very interesting project. I signed up to be among the first who will receive their personalized newspaper starting November, 16th, in the Berlin area only. Their website: www.niiu.de
Nice little video on data quality: http://www.youtube.com/watch?v=TbzQvswrOTw
Insights are not terribly deep, but cute.
Next entries »
A couple of days ago, I stumbled across a new application of data fusion: calendar fusion.
The web application at http://www.fusecal.com/ does just this: given an url, it automatically extracts calendar data from websites and creates an ical calendar, and additionally lets you combine multiple calenders into one single calendar. The integrated, fused, calendar can then be published or included in your personal organizer or favorite socal network. The application is ideal to keep track of events publish across multiple websites without checking them all, just by adding the integrated calendar into your personal organizer. I tried it by creating a calendar of four major open air cinema websites from Berlin. It works… well, yeah, it works. It’s not perfect, extracting calendar data seems not to be the easiest thing on earth. Or maybe they just read the wrong papers. The fusion part however works, but is not really difficult, as duplicate detection is totally excluded and contradicting events are plainly ignored. Some obvious improvements exist, but the system is still beta. So lets wait for the real deal.
However, it’s a cool thing, easy to use and saves you some time. See for yourself, the Berlin open air cinema calendar
Update 8/30/2009: unfortunately fusecal went offline. Seems that they couldn’t find a way to make money with it.