Why Learn about HTML Website Scraping?
Why Learn about HTML Website Scraping?
Use your powers for good
The HTML source code
A DOM parser in your chosen language
Server environment to run your code
Patience and persistence to get it working
DOM: Document Object Model.
A bit like a skeleton.
Grab it: github.com/sprise/parse-static
We'll start with a source HTML file that is a list of companies
This page SHOULD look intimidatingly long
Come with me to the Simple HTML Dom documentation:
http://simplehtmldom.sourceforge.net/
We are using this parser because of its docs.
<?php
require_once('./simple_html_dom.php');
// Load our static html file as a string
$html = file_get_html('./distributors.html');
// Start an array to put all our companies in
$companies = array();
<?php
require_once('./simple_html_dom.php');
// Load our static html file as a string
$html = file_get_html('./distributors.html');
// Start an array to put all our companies in
$companies = array();
// Try this first and see how it works
print_r($html->find('div[class=one-state]', 0)->innertext);
// Start an array to put all our companies in
$companies = array();
// Go through each state
foreach($html->find('div[class=one-state]') as $row) {
$state_name = $row->find('h2', 0)->innertext;
// Do this just for fun and testing
echo $state_name.'<br>';
}
// Go through each state
foreach($html->find('div[class=one-state]') as $row) {
$state_name = $row->find('h2', 0)->innertext;
// Loop over the .col-md-6 divs
foreach($row->find('div[class=col-md-6]') as $comp) {
// Place an array of the company data into an array of states
// (will create $state_name array if needed)
$companies[$state_name][] = array('a company');
}
}
echo '<pre>';
print_r($companies);
echo '</pre>';
// Go through each state
foreach($html->find('div[class=one-state]') as $row) {
$state_name = $row->find('h2', 0)->innertext;
// Loop over the .col-md-6 divs
foreach($row->find('div[class=col-md-6]') as $comp) {
// Place an array of the company data into an array of states
// (will create $state_name array if needed)
$companies[$state_name][] = array(
// Arg 2 lets us receive a string, instead of array
'title' => $comp->find('h3', 0)->innertext,
'rep_name' => $comp->find('strong[class="rep"]', 0)->innertext,
'tel' => $comp->find('span[class="tel"]', 0)->innertext,
'address' => $comp->find('span[class="addr"]', 0)->innertext,
);
}
}
// Count up the total
$total = 0;
foreach($companies as $row) $total += count($row);
// Show the output as JSON with stats on top
echo count($companies).' states were recorded with '.$total.' records.<br>';
Aren't you glad PHP handled that for us?
// Show the output as JSON with stats on top
echo count($companies).' states were recorded with '.$total.' records.<br>';
echo '<pre>';
echo json_encode($companies, JSON_PRETTY_PRINT); // optional flag for human display
echo '</pre>';
JSON is an easy format to use, store, and share.
Did you get the JSON object full of 500+ companies?
You will really be able to save someone's bacon one day with this skill.
You didn't think I wrote that all out by hand, did you?