Sep 22, 2023
Gandalf: Harnessing DistilBERT Transformer and BiLSTM for Precise Website Content Classification and Blocking
About the Project
The Internet has become an essential part of one’s daily life, offering vast amounts of information at one’s fingertips. However, it also presents a significant threat to young web users who may encounter inappropriate or harmful content. To address this issue, a deep learning model, named the “Gandalf Model”, is proposed for accurately identifying the type of content on a website. The Gandalf Model uses a DistilBERT Transformer and BiLSTM to classify websites into ten categories, including NSFW. The model has been trained on a large dataset of over 400,000 data entries, which has been curated and labeled for this purpose. The experimental results demonstrate that the model achieves an accuracy of 80.1%, which is better than many traditional models used in this field. The model protects young web users from inappropriate or harmful content on the Internet thus contributing to the development of effective content filtering and blocking systems. By coupling this model with a proxy server, access to websites belonging to categories as desired by the system administrators is also successfully blocked.