So here is a diagram of the plan in which I had in mind to take over the world and catalog all of the extensions on the web:

Click for a larger image
Thank you Dia for letting my express my thoughts in boxes and stick figures. Here is a quick breakdown of some of the components
- A URL list is simply a list of URL that are known to contain extensions. For example source repositories such as AMO and mozdev.
- Google API for more separated addons, such as those on blogs and personal sites
- Manual entries for addons not hosted on webpages. These are usually commercial addons such as McAfee.
- Site-specific and generic refer to the rules that the crawler must obey. For example, a generic crawler would crawl a personal site such as example.com, while a site-specific policies would handle sites such as AMO where experimental addons require a login.
- Crawler is a web crawler. I have been having difficulty finding the best tool for the job.
- Parser parses .xpi files. We should also save the html files to extract contextual information where-ever possible.
- Site-speicifc persistent storage is just a database for each site we visit. This may have to be rethought, but I want some sort of redundancy plan to keep files saved even if something horrendous happen to a central database. Especially when dealing with beta software and unfamiliar technology such as web crawlers.
- Compared compares what is stored with a central database. Addons are updated all the time, so we want to the most up-to-date versions available.
- View is used by the website to provide information for the user.
There are still some quirks which have to be figured out:
- Version bumping on AMO doesn’t change the actual install.rdf in the xpi file. Instead, Firefox does some update magic to fix that. I either need to work with said magic, or leave it alone (I don’t think it is entirely a big deal. But it should be noted).
- JSpider is a java spider that I have been setting my eyes on. Yeah, it’s java, but many other crawlers are too. Many other crawlers do both crawl and index, and I different functionality (I need a flexible crawler. Forget the indexer). Unfortunately, JSpider doesn’t have POST data and web form authentication. Which means I’m going to have to fix that if I want to use it.
- Google’s Search API TOS doesn’t seem to be spider friendly. I may have to try out other web search engines.
On a brighter note, I put up the sources of my project on the web. And even a nice place to play in. It’s a bit slow, but I’m probably into the “this isn’t what you should sqlite for” territory.

