Abstract:
The problem of discovering Internet topology and cataloging its contents has been around since it has become a global network. The same trend continued with deep web and now dark web. Learning Dark Web topology and monitoring the data flow is a focal point of attention for government representatives and law enforcement agencies. On the other side of this are the individuals or organizations seeking anonymity to execute their agendas. Those actors are doing whatever is possible to make the deep web as tangled as possible. Unlike Internet the Dark Web use encrypted networks like Tor to keep their user anonymous. This paper surveys the major challenges that can be faced when crawling the Dark Web, focusing on technical, security and ethical perspectives. Compared to the Internet network the technical challenges, crawling the dark web introduces new layers of complexity. Authors of the crawlers have to consider anonymity measures, unpredictable or malicious content or even sophisticated crawler traps. Working in a highly sophisticated environment requires mimicking regular user traffic and establishing trust in environments which are designed to hide information and deceive trespassers. Depending on the countries researchers might need to respect privacy concerns, follow legal regulations, and consider ethical implications of the data that might be crawled. The goal of this paper is to review existing literature and analyze current solutions to provide an overview of the main obstacles and state-of-the art ways to overcoming them. This paper also explores future research directions to further enhance the ability to combat the dark web threats.