2024-07-05, 09:00–09:20 (Europe/Paris), Amphitheater
Over the last decades, the proliferation of phishing websites has emerged as a significant cybersecurity threat, necessitating greater attention and research. These deceptive online websites, designed to mimic legitimate websites, aim to trick unsuspecting users into divulging sensitive information such as usernames, passwords, and financial details. Understanding the mechanics and prevalence of these malicious sites is crucial for developing effective countermeasures and safeguarding users' online security. Supervised machine learning models have become the standard for phishing detection, offering prediction capacities to security systems. These models rely largely on annotated data for their training, evaluation and ongoing maintenance. Thus, there exist a need for the efficient gathering of such annotated data to improve phishing detection methodologies.
In this talk, we will introduce WikiPhish, a novel, renewable, and open-access dataset for phishing website classification. WikiPhish consists of 110,606 webpages sourced from URLs drawn from Wikipedia's references alongside renowned phishing databases OpenPhish and PhishTank. The dataset is designed to address the challenges of phishing detection by leveraging Wikipedia's contribution verification and wide-ranging content. This allows the development of phishing detection models on a strong foundational baseline that can evolve overtime.
In this presentation, we will present Wikiphish, an innovative, renewable, and openly accessible dataset tailored for phishing website classification. The presentation will be conducted as follows:
- Explain what is phishing and the shortcoming of current phishing dataset
- Explain how we leveraged open-source data using Wikipedia as the main source of legitimate URL for the dataset.
- Present WikiPhish and its characteristics compared to existing phishing datasets.
- Present results obtained training models on this dataset.
PhD Student at Hornet Security