How to Pull Things Off Static Web Pages

An introduction to DOM scraping

Why Learn about HTML Website Scraping?

Why Learn about HTML Website Scraping?

One of these day, you'll be stuck with only finished HTML as your data source.

Why Learn about HTML Website Scraping?

When is web scraping the answer?

  • No API for passing data
  • Site is completely static, no database
  • You aren't really supposed to be doing this

Use your powers for good

What do you need to scrape a webpage?

The HTML source code

A DOM parser in your chosen language

Server environment to run your code

Patience and persistence to get it working

This can be a fiddly process, so we'll step through the right code today and get the basics.

I want you to fear no DOM!

DOM: Document Object Model.

So let's do it

Open your browser & text editor of choice

Let's scrape a webpage and turn the data into JSON

The DOM of a HTML Page

  • Is a behind-the-scenes thing
  • Like the "master record" of the page
  • Allows us to access & change the page after it is rendered
  • A good reason to write valid HTML!

A bit like a skeleton.

We'll start with a source HTML file that is a list of companies

Grab it: github.com/sprise/parse-static

We'll start with a source HTML file that is a list of companies

We've got:

  • Site skin with header, main nav
  • A list of states
  • A list of many companies per state
  • Contact information for each company

This page SHOULD look intimidatingly long

Meet Your DOM Parser

Come with me to the Simple HTML Dom documentation:

http://simplehtmldom.sourceforge.net/

We are using this parser because of its docs.

Create a new Simple HTML DOM object


<?php
require_once('./simple_html_dom.php');

// Load our static html file as a string
$html = file_get_html('./distributors.html');

// Start an array to put all our companies in
$companies = array();

Let's work with Simple HTML Dom Parser


<?php
require_once('./simple_html_dom.php');

// Load our static html file as a string
$html = file_get_html('./distributors.html');

// Start an array to put all our companies in
$companies = array();

// Try this first and see how it works
print_r($html->find('div[class=one-state]', 0)->innertext);

Grab those .one-state divs and loop through them


// Start an array to put all our companies in
$companies = array();

// Go through each state
foreach($html->find('div[class=one-state]') as $row) {
  $state_name = $row->find('h2', 0)->innertext;
	
  // Do this just for fun and testing
  echo $state_name.'<br>';
}

Now loop over the .col-md-6 divs


// Go through each state
foreach($html->find('div[class=one-state]') as $row) {
  $state_name = $row->find('h2', 0)->innertext;
		
  // Loop over the .col-md-6 divs 
  foreach($row->find('div[class=col-md-6]') as $comp) {
		
    // Place an array of the company data into an array of states 
    // (will create $state_name array if needed)
    $companies[$state_name][] = array('a company');
  }
}

echo '<pre>'; 
print_r($companies);
echo '</pre>'; 

Populate the company data



// Go through each state
foreach($html->find('div[class=one-state]') as $row) {
  $state_name = $row->find('h2', 0)->innertext;
		
  // Loop over the .col-md-6 divs 
  foreach($row->find('div[class=col-md-6]') as $comp) {
		
      // Place an array of the company data into an array of states 
      // (will create $state_name array if needed)
      $companies[$state_name][] = array(
		
                      // Arg 2 lets us receive a string, instead of array
        'title'		=> $comp->find('h3', 0)->innertext,
        'rep_name'	=> $comp->find('strong[class="rep"]', 0)->innertext,
        'tel'		=> $comp->find('span[class="tel"]', 0)->innertext,
        'address'	=> $comp->find('span[class="addr"]', 0)->innertext,
      );
  }
}

How many companies were there?


// Count up the total
$total = 0;
foreach($companies as $row) $total += count($row);

// Show the output as JSON with stats on top
echo count($companies).' states were recorded with '.$total.' records.<br>';

Aren't you glad PHP handled that for us?

Turn this into JSON


// Show the output as JSON with stats on top
echo count($companies).' states were recorded with '.$total.' records.<br>';
echo '<pre>';
echo json_encode($companies, JSON_PRETTY_PRINT); // optional flag for human display
echo '</pre>';

JSON is an easy format to use, store, and share.

Test the whole thing in your browser.

Did you get the JSON object full of 500+ companies?

Now get scraping!

You will really be able to save someone's bacon one day with this skill.

Bonus Round: distributors.php

You didn't think I wrote that all out by hand, did you?

What Next?

  • Never do data entry again
  • Learn about other dom parsers out there with their own strengths
  • Use these slides for your own presentation
  • For feedback or assistance click the link below: