Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks

Project Description

Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. 

We use the model presented in our ECIR paper in QuaggaLib. This library parses the raw email body into separate blocks and extracts meta-data from inline-headers. This kind of pre-processing should be used in all applications using email data. The library provides the actual written text content as well as the meta-data that would otherwise be hidden in the unstructured email body.

Reference

If you use our data or find this work related to yours, please cite us as...

  • Extraction and Representa... - Download
    1.
    Repke, T., Krestel, R.: Extraction and Representation of Financial Entities from Text. In: Consoli, S., Reforgiato Recupero, D., en Saisana, M. (reds.) Data Science for Economics and Finance. bll. 241–263. Springer, Cham (2021).
     
  • Modeling the Evolution of... - Download
    2.
    Schwanhold, R., Repke, T., Krestel, R.: Modeling the Evolution of Word Senses with Force-Directed Layouts of Co-occurrence Networks. Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change (LChange@ACL 2021). 1–6 (2021).
     
  • ComEx: Comment Exploratio... - Download
    3.
    Risch, J., Repke, T., Kohlmeyer, L., Krestel, R.: ComEx: Comment Exploration on Online News Platforms. Joint Proceedings of the ACM IUI 2021 Workshops co-located with the 26th ACM Conference on Intelligent User Interfaces (IUI). bll. 1–7. CEUR-WS.org (2021).
     
  • Visualising Large Documen... - Download
    4.
    Repke, T., Krestel, R.: Visualising Large Document Collections by Jointly Modeling Text and Network Structure. Proceedings of the Joint Conference on Digital Libraries (JCDL). (2020).
     
  • Bringing Back Structure t... - Download
    5.
    Repke, T., Krestel, R.: Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks. 40th European Conference on Information Retrieval (ECIR 2018). Springer, Grenoble, France (2018).
     

Implementations

Datasets

On this page we provide datasets used in our ECIR 2018 paper and a fully parsed Enron corpus. Data was manually annotated using our Enno tool.

  • newly collected ASF email corpus, annotated by email zones only
  • selection of Enron corpus, annotated by email zones only
  • selection of Enron corpus, detailled annotation (including names, aliases, metadata)
  • automatically split, normalised, and cleaned Enron corpus as graph

Apache Software Foundation Emails (ASF)

Annotated Enron Emails

Fully Parsed Enron Graph

Related Work

  • Original Code for Jangada, Carvalho, 2004
  • More infos and data for Jangada (600+ annotated mails in 20 newsgroup dataset)
  • MinorThird Library used by Jangada
  • 400 annotated emails by Lampert et. al (Enron data)
  • Zebra System for email zoning
  • Another implementation of Zebra
  • Talon is an awesome universal tool for everything that has to do with email structure