WINTERN 2020: ZOMBIE LINK CRAWLER

This winter, I had the opportunity to intern at Margin Research. Being a computer science student who had little to no experience in security, I was nervous even applying. Despite that, I really wanted to dip my toes into security to see what it was like. I wanted a stretch opportunity to enable me to grow and if I fail miserably, I’ll still have benefited from the experience. The internship seemed like the perfect opportunity, albeit being somewhat outside the realm of what I am comfortable with. Having been accepted, I was excited to start. The whole experience was structured in a way such that every intern had their choice from a few projects. Erring on the side of caution, I chose to work on the zombie link crawler, something that I had some understanding in. This project entailed creating a crawler that found dead links that we then verify to make sure they’re not expired domains.

The project is split into three main sections:

Make a crawler with decent traversal
Categorize links that do not resolve from the crawled pages
Determine if the domains are no longer registered and may be purchased

As with anything new, there are many questions to be answered and even more left to be discovered. Starting with the very first point, how can I make a crawler with decent traversal? Even this was too broad of a question for me at the time. I further split this question up into “how can I create a crawler?” and “how can I speed it up?”

To answer the first question, I set out to learn about Beautiful Soup, a Python library for pulling data from HTML. The one issue I ran into was processing the different types of links. Links are hard to pull out from web pages as just using a regex would be prone to false positives. Links can be embedded in <a> tags, as href parameters, etc. Some links contain the absolute path to the resource, while others are relative to the current page. Needless to say, parsing HTML is pretty difficult to do. I eventually resolved this and boiled it down to an if-else block with many other nested blocks. Now that I have a working crawler, how can I speed it up? At this point, the crawler just seemed to never stop running. I had left it on for over ninety minutes on a crawling sandbox and it still kept chugging away. After some research, I decided to switch to using Scrapy, a python crawling framework. The switch drastically improved runtime by introducing concurrency while processing multiple links at once and removing duplicate links, thereby reducing the need to re-crawl the same links.

Running the crawler on the same sandbox that took over ninety minutes before, now finished crawling in about thirty seconds. This is a massive improvement and allows for testing significantly more pages. At this point, I decided to be more ambitious and test the crawler on live sites.

Right off the bat, everything went wrong. The crawler ran on forever and after checking the output, I realized I was banned from GitHub and also blocked by DoS protection (receiving 429 HTTP errors).

My first approach was to implement a feature to block certain domains or patterns (GitHub, Google, login pages, etc). From there, I set up a list of allowed domains. Anything that falls outside of the set of allowed domains is automatically rejected. This is useful when we want to avoid certain sites or only allow certain sites like Google, GitHub, or even login pages and serves to give the user explicit control over their crawling. Another way I found to circumvent requests being blocked is by changing the amount of concurrent requests and introducing a delay between them.

While I wasn’t being blocked by GitHub anymore, I now found the crawler to be too slow. The speed was slow because it was crawling all connected sites to the starting site. This begs the question, “Do I even need to crawl GitHub?” The use-case of the crawler is to find deadlinks on a single site, not necessarily to crawl all connected sites. Taking advantage of this fact, I ended up tracking the current depth and by increasing the current depth by one for each link followed. I settled on using a max depth limit of one. Then, I had local links (all links of the same domain) reset the depth to 0 each time they were crawled. Now, all local links would be crawled and all foreign links would be crawled once (non-recursively) and logged. The runtime dropped from over half an hour to about ninety seconds.

To further improve the crawler, I wanted to work on its ability to evade detection. I decided to add user-agent switching, a minor evasive technique to avoid getting blocked. User-Agent switching works by changing our browser’s “identity” that we show to the site. Of course, this is not perfect as there are other methods of determining if a client may be lying about it’s user-agent. Coupled with hard to detect proxies, this would be fantastic for most situations. However,without a list of unbanned IP addresses to try this with, I moved on. The crawler now seemed solid except for the inability to get past DDoS protection without drastically slowing down the speed.

Moving onto the next requirement, I have to find “dead links.” Dead links are links that do not resolve to what we expect. Any link that returns less than a perfect result (status code 200) is a dead link. There are two cases: The first is when opening the link succeeds, we must then check the status code and keep anything that doesn’t have a status code of 200. Otherwise, there’s the second case where opening the link fails, we check the errors thrown and screen them out for common errors like timeout errors, connection refused errors, and DNS look up errors. If the error doesn’t fall into those categories, we can just save the link with its error message.

When an error occurs, errback_httpbin() is called. It first filters out common errors like Timeout errors, DNS lookup errors, and Connection refused errors.

A snippet of how dead links are logged:

The output is stored in a csv file with 4 fields: domain, referrer, response, and status. The domain is the domain of the link. The referrer is the link from which the current link being crawled is found. The response is the link being crawled. And lastly, the status is the response for opening the link or the error generated if it failed.

The final requirement is to verify the domains of the dead links for expired domains. This is not as simple as I first thought it would be. How can I verify a domain? Well, I can do a DNS lookup to pull information about the domain. But is that sufficient to declare a domain as expired? Will I get banned for doing too many lookups? I approached the problem by first starting with a simple program to check all the domains one by one, with a second in between. This is, to limit the rate to avoid being banned. For each DNS lookup, either it succeeds or fails. After manually reviewing the domains that passed, I further split it into multiple groups. If a DNS lookup fails, it’s quite likely to be expired (or it may just fail because not every domain may be found through a DNS lookup). Even if it passes, it may still be an expired domain. I have to first check the name. If the name field (i.e yahoo.com) is null, then it’s more likely to be expired. If the DNS lookup has a valid name field, then you now open the link to check it out. If it has a status code of 200, the link is most likely not dead. It may be dead if a domain registrar is reserving the domain and keeping it in use. This would most likely come with a message from the domain registrar about purchasing the domain. If it doesn’t, I log it for further manual analysis. Generally, status codes of 400s are also fine; it just means the site detected our request to be a little bit off or coming from a bot. Or, it could be a domain registrar holding onto the domain and not having the specific resource requested. If opening the link fails, then I’ll note it down. It most probably means it’s not expired, but it’s not in use with a domain registrar reserving the domain. Another useful thing to log would be the status of the domain. It can sometimes have useful info such as “pending for deletion.” All in all, there are a few layers to this process.

Here’s a small snippet of the output:

Last and probably least, comes tying up loose ends. I added a readme, a dependencies list, a configuration file for easy management, and a script to put everything together. That about wraps it up. There definitely is still a ton that could be added and improved upon and I’m excited to hear others’ feedback on the project. One area for improvement could be adding in a proxy to make evasion better. Another way to improve evasion is improving the user-agent switching to also switch other headers. The user-agent is what the browser claims to be. The other headers contain additional information. For example, it may have information such as the domain which redirected you to that specific site. Furthermore, you can have the crawler return additional information. Maybe you want to obtain the registrar of the domain? You can easily do that by grabbing the information from the DNS lookup.

There are many avenues to build upon the program as it is. Overall, I enjoyed the process immensely. It was a fun pass time to fill up time over the holidays. I gained a more in-depth understanding about URLs, domains, and aspects of crawling. It was a great stretch opportunity and I definitely recommend it for anyone looking to experience what cybersecurity is like.