wiki:public/20131007

Project: Blog Gathering with an In-Memory Database

Team: Prof. Dr. Christoph Meinel, Patrick Hennig, Philipp Berger

Research institution: Hasso-Plattner-Institut Potsdam

Abstract: The massive adoption of social media has provided new ways for individuals to express their opinions online. The blogosphere, an inherent part of this trend, contains a vast array of information about a variety of topics. Thus, it is a huge think tank that creates an enormous and ever-changing archive of open source intelligence.

However, it is increasingly difficult - if not impossible - for the average internet user and weblog enthusiast to grasp the blogosphere’s complexity as a whole, due to thousands of new weblogs and an almost uncountable number of new posts adding up to the collective on a daily basis. Therefore, mining, analyzing, modeling and presenting this immense data collection is of central interest. This can enable the user to detect technical trends, political atmospheric pictures or news articles personalized for specific interests.

Within this proposed project we want to develop an intelligent parallel crawler that is continuously monitoring the blogosphere. The following tasks should be covered within this project.

  • Testing of approaches and frameworks for massive parallel crawling on multi-core machines
  • Implementing tailor-made identifiers, parsers and updaters for weblogs that incorporate the unique structure of blogs
  • Testing the crawler with a SAP Hana database for parallel inserts of semi-structured data
Last modified 6 years ago Last modified on Apr 16, 2013 3:24:25 PM