The web is the world’s largest distributed system. It holds opportunities for significant storage and cost savings if we can reduce content duplication. However, current understanding of deduplication potential on the web is limited. The web also involves risks for users, particularly with the potential to direct users to unused domains for further exploitation. In this project, we conduct web-scale analysis to better understand these opportunities and challenging, by using the Common Crawl dataset, which contains petabytes of crawler data covering the majority of the web since 2008. The animation video above shows how the Internet structure has evolved over a period of more than 20 years.
Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains
Web users often follow hyperlinks hastily, expecting them to be cor- rectly programmed. However, it is possible those links contain typos or other mistakes. By discovering active but erroneous hyperlinks, a malicious actor can spoof a website or service, impersonating the ex- pected content and phishing private information. In typosquatting, misspellings of common domains are registered to exploit errors when users mistype a web address. Yet, no prior research has been dedicated to situations where the linking errors of web publishers (i.e. developers and content contributors) propagate to users. We hypothesize that these hijackable hyperlinks exist in large quantities with the potential to generate substantial traffic. Analyzing large-scale crawls of the web using high-performance computing, we show the web currently contains active links to more than 572 000 dot-com domains that have never been registered, what we term phantom domains. Registering 51 of these, we see 88% of phantom domains exceeding the traffic of a control domain, with up to 10 times more visits. Our analysis [1] shows that these links exist due to 17 common publisher error modes, with the phantom domains they point to free for anyone to purchase and exploit for under $20, representing a low barrier to entry for potential attackers.
A Universal Deduplication Architecture for Secure and Efficient Cloud Storage
Users now produce data at a rate that exceeds their ability to securely store and manage it all, provoking them to entrust their private files to Cloud Storage Providers (CSPs). These companies discreetly inspect users’ files to undertake deduplication, which stores only a single instance of files that are redundant across their user base. By undertaking deduplication in this way, the CSP acquires low-cost storage at the expense of user privacy. This paper proposes universal deduplication, an alternative approach which shifts the advantage of deduplication from the CSP to the users, while ensuring semantic security of the users’ transmitted data. Universal deduplication leverages indications of the trustworthiness of data availability on the Internet, paired with a format to automatically combine client- side deduplication and end-to-end encryption. By referencing data that is publicly available on the Internet, user files can be privately deduplicated without the need to transmit sensitive user data, while simultaneously reducing storage and encryption costs. We have developed an architecture for the implementation of universal deduplication, along with a preliminary investigation into the feasibility of the proposed concepts.
Project Team
Kevin Saric
Dr. Gowri Ramachandran
Dr. Surya Nepal (CSIRO’s Data61)
Prof. Raja Jurdak