Thursday, January 17, 2013

Google Reverse Image Search scraping without API in PHP

Probably some of you have used google reverse image search - that is when you drag image from your computer to the search field or paste image url after clicking on camera icon. But there is not API for that to get the results nicely in JSON or XML without any hassle. There was API for google image search which is now deprecated but it didn't provide the reverse image search functionality anyway.
Google reverse image search by url
So I searched for other APIs. First one that I found and was recommended on the internet as alternative to Google is the TinEye. I tried uploading some pictures on their website but the results weren't so rich as Google Reverse Image Search.
Other alternative was the Bing Search API. I didn't find anything about reverse image search in the description, so I had setup quickly Bing Search API to test its functionality. All it had was just normal search API - no reverse search. So if you want usual search API for images then consider using bing search API.

Okay lets jump into the google reverse image search.
I bet you're wondering how could you automate the part when you have to drag an image to Google search box or what is the full URL when you upload image by URL. The full URL is
https://www.google.com/searchbyimage?&image_url=<YOUR URL>
For example
https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg
If you go to the above address from your browser you will get the the search results and will see that the link is different. It was redirected. So if you use in your code something like


file_get_contents(https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg);

You will get to the first page with status 302. What you need is to follow the redirect chain. At this pont cURL comes to the rescue. The cURL below works like charm and opens the google's search results page.
function open_url($full_url)
    {
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $full_url);
        curl_setopt($curl, CURLOPT_HEADER, 0);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_REFERER, 'http://www.kaizern.com');
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11");
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        $content = utf8_decode(curl_exec($curl));
        curl_close($curl);
        return $content;
    }
The $full_url variable is the full URL for reverse image search like https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg
And open_url function returns whole results page. Next step is probably dropping out unneeded html content, lets say we need everything from in <body> tags and <head> tag will be dropped.

function get_tag_content_as_dom($img_res_url, $tag_name = 'body')
    {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = false;  // turn off warnings and errors when parsing
        @$dom->loadHTML($img_res_url);
        $body = $dom->getElementsByTagName($tag_name);
        $body = $body->item(0);
        $new_dom = new DOMDocument();
        $node = $new_dom->importNode($body, true);
        $new_dom->appendChild($node);
        return $new_dom;

    }
So lets sort out that function. First argument it takes is the result from the open_url function and second argument is the html tag which contents we need. We use PHP's DOMDocument library. Next we get element by tag name 'body' and afterwards we make a new DOMDocument with all the childs notes recursively to traverse it later with xpath.

Now is the time to analyze the HTML structure of Google search result to write correct xpath query. Try playing around by uploading different pictures to Google Image search. You will see that google recognizes some images and writes a best guess for this image. What I did next is that I opened my Chrome debugger and wrote down the path where the best guess is. Here's the path:

<div id="main">  <div>    <div id="cnt">      <div id="rcnt">        <div id="center_col">          <div class="med" id="res" role="main">            <div id="topstuff">
The xpath query for that is
/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff'] 
My function for that is

function get_xpath_result($dom, $xpath_query)
    {
        $dom_xpath = new DOMXPath($dom);
        return $dom_xpath->query($xpath_query);
    }
The first argument is the DOM document that you got with previous function get_tag_content_as_dom and second argument is the xpath query for the topstuff id in HTML
Don't worry if you don't grasp the whole picture now, I will write below  in this post my whole class for that.
I have the xpath query in the class's scope as static variable
static $topstuff_div_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff']";
In my class's constructor this part looks like
$topstuff_div = $this->get_xpath_result($body_dom, self::$topstuff_div_query);
Next we need to get the text after best guess for this image
I chose a string parsing approach instead of xpath and this method

function get_best_guess($topstuff_div)
    {
        $topstuff_result = '';
        foreach ($topstuff_div as $val) {
            $topstuff_result .= $val->nodeValue . " ";
        }
        $best_guess = $this->strstr_after($topstuff_result, 'Best guess for this image:');
        return trim($best_guess, ' ');
    }

This function's argument is the result of our previous function get_xpath_result as you can guess from the argument variable name and get_best_guess returns the text that is after best guess for this image. In our example for the beautiful landscape image it is sentieri del cuore.
What if there is no best guess? Then there is no div with id topstuff. Then we have to jump into the search results. Again jump into the HTML of the search results and look how the results are structured. Here, I wrote it down for myself
<div id="main">
  <div>
    <div id="cnt">
      <div id="rcnt">
        <div id="center_col">
          <div class="med" id="res" role="main">
            <div id="topstuff">
            <div id="search">
              <div id="ires">
                <ol id="rso">
                  <li class="g">
                  NOT<li id="imagebox_bigimages">
Each result is in list element with class g. In addition, the similar images are also in list with additional id imagebox_bigimages. If you don't need the results of similar images, then exclude it in your xpath query.
By going deeper in results HTML we see that each results title is in <h3> tag with class r and the description is in <span> element with class st. Here are my xpath queries for title and description for each search result with excluded similar images list element.

static $title_h3_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//h3[@class='r']";

static $span_text_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//span[@class='st']";

We do here cleverly by using double slashes '//span' which means to match child element of any depth, whereas single slash '/div' means to match the next child element.

$titles = $this->get_xpath_result($body_dom, self::$title_h3_query);

$span_texts = $this->get_xpath_result($body_dom, self::$span_text_query);

Here you got the titles and description as DomNodeList. At this point you need to know what you want to do with the results. For example I have Wordpress plugin that when I upload an image then it scrapes the information about that image, removes unneccessary words and suggests me title name for that.
You can traverse that the DomNodeList with a simple for loop. Here is example which concatenates all the titles together.

    function loop_xpath_res($xpath_res)
    {
        foreach ($xpath_res as $val) {
            echo $val->nodeValue . " | ";
        }
        echo "\n";
    }

Below is my whole class consisting of the functions I've written above. I don't pretend for the neatest and conventional code, because my priamry coding language isn't PHP :)
require_once(ABSPATH . "wp-admin" . '/includes/image.php');
require_once(ABSPATH . "wp-admin" . '/includes/file.php');
require_once(ABSPATH . "wp-admin" . '/includes/media.php');

class Import_Script
{

    static $URL = 'https://www.google.com/searchbyimage?&image_url=';
    static $topstuff_div_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff']";
    static $title_h3_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']
        /div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//h3[@class='r']";
    static $span_text_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']
    /div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//span[@class='st']";
    
 static $domains = array(".com", ".org", ".net", ".hu");
    static $links = array("http://", "www");
    static $picture_file_names = array(".jpg", ".jpeg", ".png");

    public $img_file_name;

    function __construct($image_url, $order)
    {

        try {

            $full_url = $this->compose_url($image_url);
            $img_res_url = $this->open_url($full_url);
            $body_dom = $this->get_tag_content_as_dom($img_res_url);

            $topstuff_div = $this->get_xpath_result($body_dom, self::$topstuff_div_query);
            $best_guess = $this->get_best_guess($topstuff_div);

            $titles = $this->get_xpath_result($body_dom, self::$title_h3_query);

            $span_texts = $this->get_xpath_result($body_dom, self::$span_text_query);

            // if length is > 0 then search result isn't empty
            if ($titles->length > 0 && $span_texts->length > 0) {
                $best_guess = $this->sanitize_best_guess($best_guess);
                if ($best_guess) {
                    $this->img_file_name = strtolower($this->compose_img_file_name($best_guess));
                } else {
                    $this->img_file_name = strtolower($this->compose_img_file_name($img_name));
                }
            } else {
                echo "Nothing found about the picture, url: " . $image_url;
            }
        } catch (Exception $e) {
            echo 'Exception caught: ',  $e->getMessage(), "<br />";
            echo 'Exception for url: '.$image_url."<br />";
            sleep(10);
        }
    }

    function loop_xpath_res($xpath_res)
    {
        foreach ($xpath_res as $val) {
            echo $val->nodeValue . " | ";
        }
        echo "<br />";
    }

    function compose_img_file_name($word)
    {
        return str_replace(" ", "-", $word);
    }

    function sanitize_best_guess($best_guess)
    {
        if ($best_guess) $best_guess = $this->filter_out_bad_words($best_guess);
        return $best_guess;
    }

    function filter_out_bad_words($string)
    {
        $string = $this->remove_containing_word($string, self::$links);
        $string = $this->remove_containing_word($string, self::$domains);
        $string = $this->remove_containing_word($string, self::$picture_file_names);
        $string = preg_replace('/[^ -\pL]/', '', $string);
        $string = preg_replace("#[^a-zA-Z0-9 -]#", "", $string);
        $string = trim($string, '-');
        $string = trim($string, ' ');
        $string = $this->remove_specific_word($string, '-');
        $string = trim($string, '-');
        $string = trim($string, ' ');
        return $string;
    }

    function contains_specific_word($string, $specific_word)
    {
        $string_array = explode(" ", $string);
        foreach ($string_array as $element) {
            if (strcasecmp($element, $specific_word) == 0) return true;
        }
        return false;
    }

    function remove_specific_word($string, $bad_words)
    {
        $string_array = explode(" ", $string);
        if (is_array($bad_words)) {
            foreach ($string_array as $index => $word) {
                foreach ($bad_words as $bad_word) {
                    if (strcasecmp($word, $bad_word) == 0) unset($string_array[$index]);
                }
            }
        } else {
            foreach ($string_array as $index => $word) {
                if (strcasecmp($word, $bad_words) == 0) unset($string_array[$index]);
            }
        }
        return implode(" ", $string_array);
    }

    function remove_containing_word($string, $word_peaces)
    {
        $new_string = $string;
        foreach ($word_peaces as $part_of_word) {
            $word_pos = strripos($new_string, $part_of_word);
            if ($word_pos !== false) {
                $words_array = explode(" ", $new_string);
                foreach ($words_array as $index => $word) {
                    if (stripos($word, $part_of_word) !== false) {
                        unset($words_array[$index]);
                    }
                }
                $new_string = implode(" ", $words_array);
            }
        }
        return $new_string;
    }

    function get_best_guess($topstuff_div)
    {
        $topstuff_result = '';
        foreach ($topstuff_div as $val) {
            $topstuff_result .= $val->nodeValue . " ";
        }
        $best_guess = $this->strstr_after($topstuff_result, 'Best guess for this image:');
        return trim($best_guess, ' ');
    }

    function strstr_after($haystack, $needle)
    {
        $pos = stripos($haystack, $needle);
        if (is_int($pos)) {
            return substr($haystack, $pos + strlen($needle));
        }
        // Most likely false or null
        return $pos;
    }

    function compose_url($request)
    {
        $full_url = self::$URL . $request;
        return $full_url;
    }

    function open_url($full_url)
    {
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $full_url);
        curl_setopt($curl, CURLOPT_HEADER, 0);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_REFERER, 'http://localhost');
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11");
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        $content = utf8_decode(curl_exec($curl));
        curl_close($curl);
        return $content;
    }

    function get_tag_content_as_dom($img_res_url, $tag_name = 'body')
    {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = false; // turn off warnings and errors when parsing
        @$dom->loadHTML($img_res_url);
        $body = $dom->getElementsByTagName($tag_name);
        $body = $body->item(0);
        $new_dom = new DOMDocument();
        $node = $new_dom->importNode($body, true);
        $new_dom->appendChild($node);
        return $new_dom;
    }

    function get_xpath_result($dom, $xpath_query)
    {
        $dom_xpath = new DOMXPath($dom);
        return $dom_xpath->query($xpath_query);
    }
}

EDIT: GITHUB CODE IS HERE

UPDATE:
Here is the update version with SimpleHTMLDom parsing library.
https://gist.github.com/skyzer/24f80640e99070ec83bc
I included only the functions that need to be replaced from xpath selecting to simplehtmldom type selecting.
The $variable->find(...) is the simplehtmldom librarys selector that is choosing correctly