Softcore software development
It's all about the cycles
  • Home
  • About

An overly-complex diabolical plan

addons Add comments

So here is a diagram of the plan in which I had in mind to take over the world and catalog all of the extensions on the web:

Click for a larger image

Thank you Dia for letting my express my thoughts in boxes and stick figures. Here is a quick breakdown of some of the components

  1. A URL list is simply a list of URL that are known to contain extensions. For example source repositories such as AMO and mozdev.
  2. Google API for more separated addons, such as those on blogs and personal sites
  3. Manual entries for addons not hosted on webpages. These are usually commercial addons such as McAfee.
  4. Site-specific and generic refer to the rules that the crawler must obey. For example, a generic crawler would crawl a personal site such as example.com, while a site-specific policies would handle sites such as AMO where experimental addons require a login.
  5. Crawler is a web crawler. I have been having difficulty finding the best tool for the job.
  6. Parser parses .xpi files. We should also save the html files to extract contextual information where-ever possible.
  7. Site-speicifc persistent storage is just a database for each site we visit. This may have to be rethought, but I want some sort of redundancy plan to keep files saved even if something horrendous happen to a central database. Especially when dealing with beta software and unfamiliar technology such as web crawlers.
  8. Compared compares what is stored with a central database. Addons are updated all the time, so we want to the most up-to-date versions available.
  9. View is used by the website to provide information for the user.

There are still some quirks which have to be figured out:

  • Version bumping on AMO doesn’t change the actual install.rdf in the xpi file. Instead, Firefox does some update magic to fix that. I either need to work with said magic, or leave it alone (I don’t think it is entirely a big deal. But it should be noted).
  • JSpider is a java spider that I have been setting my eyes on. Yeah, it’s java, but many other crawlers are too. Many other crawlers do both crawl and index, and I different functionality (I need a flexible crawler. Forget the indexer). Unfortunately, JSpider doesn’t have POST data and web form authentication. Which means I’m going to have to fix that if I want to use it.
  • Google’s Search API TOS doesn’t seem to be spider friendly. I may have to try out other web search engines.

On a brighter note, I put up the sources of my project on the web. And even a nice place to play in. It’s a bit slow, but I’m probably into the “this isn’t what you should sqlite for” territory.


June 5th, 2008 |

Tags: intern, seneca, wildon


One Response to “An overly-complex diabolical plan”

  1. Dave
    June 5th, 2008 at 19:45

    Have you talked to Bob Clary? You know that he has a fancy spider that he runs from 3 boxes on hera?

    Dave


Leave a Reply

  • Categories

    • addons
    • hugs
    • Living
    • personal
    • programming
    • Uncategorized
    • Web
  • Recent Posts

    • Reordering the tab key – tabcomplete
    • (Almost) Can’t touch that new music
    • Endianness, how I loathe you
    • Update
    • AES and CBC
  • Tags

    "open source" activism audio browser compatibility bug chrome editor extension fennec google chrome house html5 hugs ie intern jquery json konqueror lazy microblog microsoft mozilla music nsid opera personal prism python regina ria safari safe security seneca shaving shoes sleep stats svg tinderbox tip toronto Web wildon windows error
  • Archives

    • July 2010
    • May 2010
    • February 2010
    • December 2009
    • November 2009
    • October 2009
    • August 2009
    • July 2009
    • February 2009
    • January 2009
    • November 2008
    • October 2008
    • September 2008
    • August 2008
    • July 2008
    • June 2008
    • May 2008
    • April 2008
RSS XHTML CSS Log in
Copyright © 2010 Softcore software development All Rights Reserved
Wp Theme by i Software Reviews
Proudly Powered by Wordpress