Coding an Armory Crawler in PHP – basic HOWTO

Upon special request, a few notes on how I build my crawler.What I’m using:
A stock XAMPP (for windows in my case) package, containing

  • PHP 5: I wanted 5 simply because it contains SimpleXML, which makes parsing easy and straightforward
  • MySQL 5: that one was in the package, I don’t think v4 would have made any difference
  • Initially nothing else.

In my latest recode I finally managed to find how to activate cURL in PHP.
The biggest difficulty I had was getting the Armory to send me back XML data instead of a formatted web page. From what little understanding I have, modern browsers are considered to have all necessary extensions to run the AJAX code locally in order to display the armoury – in that case Blizzard only sends you the page data, the rendering is done on your own computing power (I hope I did get that right). On older or unkown browsers, however, the page rendering is done on the armoury server and you are sent a formatted HTML page – which isn’t what you want.To determine the browser, the Armory will look its your User Agent. This can either be set in code or in the php.ini file.
An important note if you’re just starting out, xampp (and I expect the rest of php installations equally) has a php.ini in the \php subdirectory of your web server tools, which you can edit to your hearth’s content once you’ve started your server for the first time… without any results. I expect this one is the template used to build the real php.ini, which resides under your \apache directory. This is why I couldn’t get cURL to work for several days.

Told you I’m a noob.

Anyway, there are three ways you can “fake” your user agent so that the Armory believes you’re a modern browser:
- In php.ini (the default is set to PHP and the version number)
- In your code, you can use the below:

ini_set(‘user_agent’, ‘[a modern user agent's string]‘)

- If you’re using cURL, you can pass it in a cURL session with cURL_setopt:

$myvar = curl_init();
curl_setopt($myvar, CURLOPT_USERAGENT, ‘[a modern user
agent's
string]‘);

As for the user agent string, you could use a recent Firefox one, like
this one:
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6)
Gecko/20070725 Firefox/2.0.0.6″ . If you want others, there’s a good list here.

Once you have that, the rest is a matter of browsing through the various Armory pages, which is simply done by pointing to a valid URL.

In my case, the way I’m doing this is as follows:

  • I grab 5 arena ladder pages (which gives me the top 100)
  • Using simpleXML, I parse the URLs for each individual teams
  • I store these in a temporary table
  • In a second step (in order to limit timeout), I go “browse” these Team URLs one by one
  • I fetch the class composition and the URL to each player’s individual character sheet
  • I browse through the char sheet and get the Talent Trees
  • I store team rank, class and build type (not all Trees, just a type classification)

The rest is done by my still to be improved statistics code.

To “browse” to the various Armory URLs, there’s two systems:
Without cURL:

$myvar = fopen(URL);
$xml = file_get_contents ($myvar);
fclose ($myvar);

That one takes just three lines but apparently a lot of processing time.
Using cURL:

$myvar = curl_init();
curl_setopt($myvar, CURLOPT_USERAGENT, ‘[a modern user
agent's
string]‘);
curl_setopt($myvar, CURLOPT_URL, $myURL);
$xmlstr
= curl_exec ($myvar);
curl_close ($myvar);

Which is more code but quite a bit faster. In both cases, you can then parse $xml with whatever method your PHP release allows for – if you have PHP 5, simpleXML is the simplest way to do it, since all relevant data is actually contained in the XML attributes:

$xml = new SimpleXMLElement($xmlstr);

This will give you $xml as an object (at least I think so, as I said before, I’m a noob coder), where the various attributes can be accessed simply by the means of defining, for instance,

$myattribute = $xml->arenaTeams->arenaTeam['name']

Of course, you’ll want to study the xml of the particuliar Armory data you’re looking for in order to extract whatever you need, but you get the ghist of it.

EDIT: Changed link to the user agent strings listing, as the site appears dead

On Similar Matters

Tags: Armory, Crawler, HowTo

 

9 Comments on “Coding an Armory Crawler in PHP – basic HOWTO”

  • dj (3 comments) September 18th, 2007 11:37 am

    Its an easy way, yes.
    But it requires that you server is compiled with cURL.

    Trolling some forums and php.net and found a pure PHP way:

    pasted the code here so its easy to see:
    http://www.htmlsidan.se/code/?id=1237

    Then you just parse it with an XML function/parser


  • dj (3 comments) September 18th, 2007 11:40 am

    Ohh, you find this is the code
    $xml = xml2ary($data);

    Its just me parsing it within the function. want to output an array instead of XML


  • Gwaendar (204 comments) October 1st, 2007 4:05 pm

    Belated reply due to RL absence, but as I mentionned in my post, using ini_set followed by fopen($armoryurl) and file_get_contents does the trick too – it’s a more brutish approach though.


  • Tiggr (1 comments) November 2nd, 2007 11:07 am

    Hey. Thank you sooooo much for putting this up. I had been struggling with this for the last 2 nights. I was really confused by the fact I saw xml in the source but my page got different results.

    Thank You, Thank you, Thank you :)


  • Geekster (1 comments) November 4th, 2007 7:25 pm

    This was very helpful. To repay you I’ll have to recommend you use PEAR with XML Parser, because it’s pretty ez stuff.

    http://pear.php.net/

    Search on XML and you’ll find the parser. There are examples in there… You will also need to download the PEAR base module (pear.php)… it’s the #1 download on the main page…

    My idea was to collect data and output to a file locally that can be parsed. This works better than trying to parse directly because you control the data in a static form.


  • Gwaendar (204 comments) November 5th, 2007 1:33 pm

    Just to demonstrate my complete lack of culture and understanding, I’ve visited PEAR, read a lot of the manual, and still couldn’t get a clue what the package is good for, and more importantly, how I could leverage it for my own needs.

    It appears to my totally novice eye that it provides different means of doing what I used cURL and SimpleXML for? What did I miss here?


  • mark (1 comments) May 14th, 2008 11:33 am

    Gwaenda, stop saying you lack understanding dude, I have a php professor that explains stuff in a way that makes it seem like everything is a piece of cake so when I read any manual for 30 minutes and still I don’t have a clue what it is all about I know it isn’t my fault! lol.


  • Master Blogging and Altitis Birthsday | Altitis June 15th, 2008 11:27 am

    [...] Some people are apparently still interested in my clumsy attempts to write my own armory crawler in php. [...]


  • Jacob (1 comments) June 18th, 2009 4:40 pm

    Well cURL is indeed also required for this method, just that you say there is an easier way if you have cURL installed. you have if this is working for you.


World of Warcraft™ and Blizzard Entertainment® are all trademarks or registered trademarks of Blizzard Entertainment in the United States and/or other countries. These terms and all related materials, logos, and images are copyright © Blizzard Entertainment. This site is in no way associated with Blizzard Entertainment®