A web crawler is a hard-working bot to gather information or index the pages on the Internet. It starts at some seeds URLs and finds every hyperlink on each page, and then crawler will visit those hyperlinks recursively.
1. Choose an Ideal Programming Language
Python or Ruby probably is a wise choice, the mainly speed limit of web crawler is network latency not CPU, so choose Python or Ruby as a language to develop a web crawler will make life easier. Python provide some standard libraries, they are very useful, such like urllib, httplib and regex, those libraries can handle lots of work.
Python also has plenty of valuable third-party libraries worth a try:
scrapy, a web scraping framework.
urllib3, a Python HTTP library with thread-safe connection pooling, file post support.
greenlet, a Lightweight concurrent programming framework.
twisted, an event-driven networking engine.
2. Reading Some Simple Open-source Projects
You need to figure out how exactly does a crawler works.
Here is a very simple crawler written in Python, in 10 lines of code.
import re, urllib crawled_urls = set() def crawl(url): for new_url in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(url).read()): if new_url not in crawled_urls: print new_url crawled_urls.add(new_url) if __name__ == "__main__": url = 'http://www.yahoo.com/' crawl(url)
Crawler usually needs to keep track of which URLs need to be crawled, and which URLs has already crawled (to avoid the infinite loop).
3.Choosing the Right Data Structure
Choosing a proper data structure will make your crawler efficiently. Queue or Stack is a good choice to store the URLs need be crawled, Hash table or R-B tree seems proper for tracking the crawled URLs, it provide a fast speed to search.
Search Time Complexity: Hash table O(1), R-B Tree O(log n)
But what if your crawler needs to deal with tons of URLs your memory is not enough? Try to store the checksum of URL string, if it still not enough, you may need to use the Cache algorithms (such like LRU) to dump some URLs into the disk.
4. Multithreading and Asynchronous
If you crawling sites from different servers, using multithreading or asynchronous mechanism will save you lots of time.
Remember keep your crawler thread-safe, you need a thread-safe queue to share the results and a thread controller to handle threads.
Asynchronous is a event-based mechanism will make your crawler enter a while loop, when an events triggers (some resources become available), your crawler will wake up to do deal with this event (usually by execute callback function), Asynchronous can improve throughput, latency of your crawler.
How to write a multi-threaded webcrawler, Andreas Hess
5. HTTP Persistent Connections
Every time sends an HTTP request you need to open a TCP socket connection, when you finish request, this socket will be closed. When you crawl lots of pages on a same server, you will open and close the socket again and again. The overhead cost is quite a big problem.
Use this header in your HTTP request to tell the server your client support keep-alive. Your code also should be modified accordingly.
6. Efficient Regular Expressions
You should really figure out how the regex works, a good regex really makes a difference in performance.
When your web crawlers parsing the information of the HTTP response, the same regex will execute frequently. Compile a regex need little more time in the beginning, but it will run faster when you use it. Notice if you are using Python (or .NET), it will automatically compile and cache the regexs, but it may still be worthwhile to manually do it, you can give it a proper name after compiling a regex, it will make your code more readable.
If you want parser even faster, you probably need to write a parser by yourself.
Mastering Regular Expressions, Third Edition by Jeffrey Friedl.
Performance of Greedy vs. Lazy Regex Quantifiers, Steven Levithan
Optimizing regular expressions in Java, Cristian Mocanu