Clustering large amount of email with Minhash: an open-source Locality sensitive hash
2023-07-04, 10:10–10:30 (Europe/Paris), Amphitheater

In the last decades, the world connectivity has increased exponentially, and emails is one of the key indicator of this connectivity. In 2022, more than 340 billions emails were sent on average each day, an increase of about 5% in comparison to the preview year. Because the reach of emails is so broad, they have been in the recent years used more and more to perform a wide variety of cyber security attacks. On the one side, targeted attack such as spear-phishing or Business Email Comprise (BEC) can be disastrous for companies and are responsible for millions of dollar loss each year. These kind of attacks are usually fine tuned to deceive the victim, and thus very hard to detect with automation. Furthemore they are really sparse in comparison to other types of email attacks (1 in 100 000 emails). On the other side, spam and phishing campaigns are broad attacks that usually target large group of email address. Campaign attacks are typically composed of bulks of email sharing a similar template and sent en masse in the hope of hitting just a small fraction of their targets, prioritizing quantity of attack sent over quality of the attack (about 80% of emails sent every day are spam emails). For cybersecurity providers such as Vade, a challenge is to detect and block these campaigns as fast as possible. While emails in a campaign used to be the exact same and thus relatively easy to catch, attackers have been more and more keen to add noise and tricks to fool detection algorithms, while still maintaining the visual aspect of the email. This evolution has seen, as a consequence, an increase in interest for the nearest neighbor problem. The nearest neighbor problem (nnp) is an optimization problem that arise for many kind of data driven tools. In particular, detecting duplicate or near-duplicate document is a critical application of the nnp. A similarity search problem usually involves a large collection of object, each characterized by a set of features and re-presentable as points in high-dimensional attribute space. Given a document, we are queried to find its most similar documents in the database. This problem has been shown to be NP-complete, and as such is still unfeasible to solve in reasonable time

In this presentation, we will present a full pipeline of clusturisation of email sent in a continuous flow, from the email to the clusters, using minhash (, an open source locality sensitive hashing algorithm. The presentation will be conducted as follow:
- Explain how to extract key data from the email and remove the content added to fool the clustering algorithm.
- Explain normalization through open source tools such as "". This helps reducing the noise to info ratio in the email.
- Present Locality sensitive hashing through the open source algorithm minhash, which creates fingerprints that will collide for similar email.
- Present the "Bucketization" technique to cluster the fingerprints.
- Present results on real email data.

See also: Slides

PhD from "Université de Lille", INRIA (french) and MODO (Japanese) Lab, specialized in large scale optimization assisted by machine learning tools.

Now working at Vade as research engineer.