Thursday, January 17, 2013

Google Reverse Image Search scraping without API in PHP

Probably some of you have used google reverse image search - that is when you drag image from your computer to the search field or paste image url after clicking on camera icon. But there is not API for that to get the results nicely in JSON or XML without any hassle. There was API for google image search which is now deprecated but it didn't provide the reverse image search functionality anyway.
Google reverse image search by url
So I searched for other APIs. First one that I found and was recommended on the internet as alternative to Google is the TinEye. I tried uploading some pictures on their website but the results weren't so rich as Google Reverse Image Search.
Other alternative was the Bing Search API. I didn't find anything about reverse image search in the description, so I had setup quickly Bing Search API to test its functionality. All it had was just normal search API - no reverse search. So if you want usual search API for images then consider using bing search API.

Okay lets jump into the google reverse image search.
I bet you're wondering how could you automate the part when you have to drag an image to Google search box or what is the full URL when you upload image by URL. The full URL is
https://www.google.com/searchbyimage?&image_url=<YOUR URL>
For example
https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg
If you go to the above address from your browser you will get the the search results and will see that the link is different. It was redirected. So if you use in your code something like


file_get_contents(https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg);

You will get to the first page with status 302. What you need is to follow the redirect chain. At this pont cURL comes to the rescue. The cURL below works like charm and opens the google's search results page.
function open_url($full_url)
    {
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $full_url);
        curl_setopt($curl, CURLOPT_HEADER, 0);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_REFERER, 'http://www.kaizern.com');
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11");
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        $content = utf8_decode(curl_exec($curl));
        curl_close($curl);
        return $content;
    }
The $full_url variable is the full URL for reverse image search like https://www.google.com/searchbyimage?&image_url=http://kaizern.com/blog/beautiful-landscapes-1.jpg
And open_url function returns whole results page. Next step is probably dropping out unneeded html content, lets say we need everything from in <body> tags and <head> tag will be dropped.

function get_tag_content_as_dom($img_res_url, $tag_name = 'body')
    {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = false;  // turn off warnings and errors when parsing
        @$dom->loadHTML($img_res_url);
        $body = $dom->getElementsByTagName($tag_name);
        $body = $body->item(0);
        $new_dom = new DOMDocument();
        $node = $new_dom->importNode($body, true);
        $new_dom->appendChild($node);
        return $new_dom;

    }
So lets sort out that function. First argument it takes is the result from the open_url function and second argument is the html tag which contents we need. We use PHP's DOMDocument library. Next we get element by tag name 'body' and afterwards we make a new DOMDocument with all the childs notes recursively to traverse it later with xpath.

Now is the time to analyze the HTML structure of Google search result to write correct xpath query. Try playing around by uploading different pictures to Google Image search. You will see that google recognizes some images and writes a best guess for this image. What I did next is that I opened my Chrome debugger and wrote down the path where the best guess is. Here's the path:

<div id="main">  <div>    <div id="cnt">      <div id="rcnt">        <div id="center_col">          <div class="med" id="res" role="main">            <div id="topstuff">
The xpath query for that is
/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff'] 
My function for that is

function get_xpath_result($dom, $xpath_query)
    {
        $dom_xpath = new DOMXPath($dom);
        return $dom_xpath->query($xpath_query);
    }
The first argument is the DOM document that you got with previous function get_tag_content_as_dom and second argument is the xpath query for the topstuff id in HTML
Don't worry if you don't grasp the whole picture now, I will write below  in this post my whole class for that.
I have the xpath query in the class's scope as static variable
static $topstuff_div_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff']";
In my class's constructor this part looks like
$topstuff_div = $this->get_xpath_result($body_dom, self::$topstuff_div_query);
Next we need to get the text after best guess for this image
I chose a string parsing approach instead of xpath and this method

function get_best_guess($topstuff_div)
    {
        $topstuff_result = '';
        foreach ($topstuff_div as $val) {
            $topstuff_result .= $val->nodeValue . " ";
        }
        $best_guess = $this->strstr_after($topstuff_result, 'Best guess for this image:');
        return trim($best_guess, ' ');
    }

This function's argument is the result of our previous function get_xpath_result as you can guess from the argument variable name and get_best_guess returns the text that is after best guess for this image. In our example for the beautiful landscape image it is sentieri del cuore.
What if there is no best guess? Then there is no div with id topstuff. Then we have to jump into the search results. Again jump into the HTML of the search results and look how the results are structured. Here, I wrote it down for myself
<div id="main">
  <div>
    <div id="cnt">
      <div id="rcnt">
        <div id="center_col">
          <div class="med" id="res" role="main">
            <div id="topstuff">
            <div id="search">
              <div id="ires">
                <ol id="rso">
                  <li class="g">
                  NOT<li id="imagebox_bigimages">
Each result is in list element with class g. In addition, the similar images are also in list with additional id imagebox_bigimages. If you don't need the results of similar images, then exclude it in your xpath query.
By going deeper in results HTML we see that each results title is in <h3> tag with class r and the description is in <span> element with class st. Here are my xpath queries for title and description for each search result with excluded similar images list element.

static $title_h3_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//h3[@class='r']";

static $span_text_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//span[@class='st']";

We do here cleverly by using double slashes '//span' which means to match child element of any depth, whereas single slash '/div' means to match the next child element.

$titles = $this->get_xpath_result($body_dom, self::$title_h3_query);

$span_texts = $this->get_xpath_result($body_dom, self::$span_text_query);

Here you got the titles and description as DomNodeList. At this point you need to know what you want to do with the results. For example I have Wordpress plugin that when I upload an image then it scrapes the information about that image, removes unneccessary words and suggests me title name for that.
You can traverse that the DomNodeList with a simple for loop. Here is example which concatenates all the titles together.

    function loop_xpath_res($xpath_res)
    {
        foreach ($xpath_res as $val) {
            echo $val->nodeValue . " | ";
        }
        echo "\n";
    }

Below is my whole class consisting of the functions I've written above. I don't pretend for the neatest and conventional code, because my priamry coding language isn't PHP :)
require_once(ABSPATH . "wp-admin" . '/includes/image.php');
require_once(ABSPATH . "wp-admin" . '/includes/file.php');
require_once(ABSPATH . "wp-admin" . '/includes/media.php');

class Import_Script
{

    static $URL = 'https://www.google.com/searchbyimage?&image_url=';
    static $topstuff_div_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']/div[@id='topstuff']";
    static $title_h3_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']
        /div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//h3[@class='r']";
    static $span_text_query = "/body/div[@id='main']/div/div[@id='cnt']/div[@id='rcnt']/div[@id='center_col']/div[@id='res']
    /div[@id='search']/div[@id='ires']/ol[@id='rso']/li[not(@id='imagebox_bigimages') and @class='g']/div[@class='vsc']//span[@class='st']";
    
 static $domains = array(".com", ".org", ".net", ".hu");
    static $links = array("http://", "www");
    static $picture_file_names = array(".jpg", ".jpeg", ".png");

    public $img_file_name;

    function __construct($image_url, $order)
    {

        try {

            $full_url = $this->compose_url($image_url);
            $img_res_url = $this->open_url($full_url);
            $body_dom = $this->get_tag_content_as_dom($img_res_url);

            $topstuff_div = $this->get_xpath_result($body_dom, self::$topstuff_div_query);
            $best_guess = $this->get_best_guess($topstuff_div);

            $titles = $this->get_xpath_result($body_dom, self::$title_h3_query);

            $span_texts = $this->get_xpath_result($body_dom, self::$span_text_query);

            // if length is > 0 then search result isn't empty
            if ($titles->length > 0 && $span_texts->length > 0) {
                $best_guess = $this->sanitize_best_guess($best_guess);
                if ($best_guess) {
                    $this->img_file_name = strtolower($this->compose_img_file_name($best_guess));
                } else {
                    $this->img_file_name = strtolower($this->compose_img_file_name($img_name));
                }
            } else {
                echo "Nothing found about the picture, url: " . $image_url;
            }
        } catch (Exception $e) {
            echo 'Exception caught: ',  $e->getMessage(), "<br />";
            echo 'Exception for url: '.$image_url."<br />";
            sleep(10);
        }
    }

    function loop_xpath_res($xpath_res)
    {
        foreach ($xpath_res as $val) {
            echo $val->nodeValue . " | ";
        }
        echo "<br />";
    }

    function compose_img_file_name($word)
    {
        return str_replace(" ", "-", $word);
    }

    function sanitize_best_guess($best_guess)
    {
        if ($best_guess) $best_guess = $this->filter_out_bad_words($best_guess);
        return $best_guess;
    }

    function filter_out_bad_words($string)
    {
        $string = $this->remove_containing_word($string, self::$links);
        $string = $this->remove_containing_word($string, self::$domains);
        $string = $this->remove_containing_word($string, self::$picture_file_names);
        $string = preg_replace('/[^ -\pL]/', '', $string);
        $string = preg_replace("#[^a-zA-Z0-9 -]#", "", $string);
        $string = trim($string, '-');
        $string = trim($string, ' ');
        $string = $this->remove_specific_word($string, '-');
        $string = trim($string, '-');
        $string = trim($string, ' ');
        return $string;
    }

    function contains_specific_word($string, $specific_word)
    {
        $string_array = explode(" ", $string);
        foreach ($string_array as $element) {
            if (strcasecmp($element, $specific_word) == 0) return true;
        }
        return false;
    }

    function remove_specific_word($string, $bad_words)
    {
        $string_array = explode(" ", $string);
        if (is_array($bad_words)) {
            foreach ($string_array as $index => $word) {
                foreach ($bad_words as $bad_word) {
                    if (strcasecmp($word, $bad_word) == 0) unset($string_array[$index]);
                }
            }
        } else {
            foreach ($string_array as $index => $word) {
                if (strcasecmp($word, $bad_words) == 0) unset($string_array[$index]);
            }
        }
        return implode(" ", $string_array);
    }

    function remove_containing_word($string, $word_peaces)
    {
        $new_string = $string;
        foreach ($word_peaces as $part_of_word) {
            $word_pos = strripos($new_string, $part_of_word);
            if ($word_pos !== false) {
                $words_array = explode(" ", $new_string);
                foreach ($words_array as $index => $word) {
                    if (stripos($word, $part_of_word) !== false) {
                        unset($words_array[$index]);
                    }
                }
                $new_string = implode(" ", $words_array);
            }
        }
        return $new_string;
    }

    function get_best_guess($topstuff_div)
    {
        $topstuff_result = '';
        foreach ($topstuff_div as $val) {
            $topstuff_result .= $val->nodeValue . " ";
        }
        $best_guess = $this->strstr_after($topstuff_result, 'Best guess for this image:');
        return trim($best_guess, ' ');
    }

    function strstr_after($haystack, $needle)
    {
        $pos = stripos($haystack, $needle);
        if (is_int($pos)) {
            return substr($haystack, $pos + strlen($needle));
        }
        // Most likely false or null
        return $pos;
    }

    function compose_url($request)
    {
        $full_url = self::$URL . $request;
        return $full_url;
    }

    function open_url($full_url)
    {
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $full_url);
        curl_setopt($curl, CURLOPT_HEADER, 0);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_REFERER, 'http://localhost');
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11");
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        $content = utf8_decode(curl_exec($curl));
        curl_close($curl);
        return $content;
    }

    function get_tag_content_as_dom($img_res_url, $tag_name = 'body')
    {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = false; // turn off warnings and errors when parsing
        @$dom->loadHTML($img_res_url);
        $body = $dom->getElementsByTagName($tag_name);
        $body = $body->item(0);
        $new_dom = new DOMDocument();
        $node = $new_dom->importNode($body, true);
        $new_dom->appendChild($node);
        return $new_dom;
    }

    function get_xpath_result($dom, $xpath_query)
    {
        $dom_xpath = new DOMXPath($dom);
        return $dom_xpath->query($xpath_query);
    }
}

EDIT: GITHUB CODE IS HERE

UPDATE:
Here is the update version with SimpleHTMLDom parsing library.
https://gist.github.com/skyzer/24f80640e99070ec83bc
I included only the functions that need to be replaced from xpath selecting to simplehtmldom type selecting.
The $variable->find(...) is the simplehtmldom librarys selector that is choosing correctly

25 comments:

  1. Hi Artur, can you provide the full source code as a downloadable file?

    ReplyDelete
    Replies
    1. https://github.com/skyzer/google-reverse-image-search-scraper

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hey Arther,
    What's the output to be expected from this code? I get a white page when running it..

    ReplyDelete
  4. I realized you require image.php, file.php, and media.php at the top of the source code. Are those files accessible to us anywhere? Maybe that is why the code I got from github didn't work..

    ReplyDelete
    Replies
    1. hey, yes these are the wordpress files. with this code in the blogpost i just showed the technique how to do it, like what is the proper CURL to write and what is the xpath. For now I have already improved that and instead of xpath i use simplehtmldom library to traverse the DOM, because the xpath didn't always work

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Hi.. Is there a java version of this returning XML.. I am new to PHP.. while I am able to aunderstand most part of it. I did not understand the top 3 lines that includes image.php, file.php, media.php

      Delete
    4. Hey, these are these files are included from Wordpress API. I don't have Java version unfortunately. My PHP version now works like a charm

      Delete
  5. I would really appreciate a blog post or an explanation about how you improved it with simplehtmldom and what the problem was with xpath that it didn't always work. I am running into some issues that I am sure you would find minor being the expert in this...If there is a way I could contact you beyond this blog I would love to get your input..

    ReplyDelete
    Replies
    1. Hey, I can post snippets of my version with simplehtmldom with some explanations okay.

      Delete
  6. Does it this function still works? When I tried it for some picture I always get output "Nothing found about the picture..."
    When I tried to watch output of single functions step by step, I found that output of $this->open_url($full_url) is "302 Moved The document has moved here" which forward to $this->get_tag_content_as_dom($img_res_url) and next...

    ReplyDelete
    Replies
    1. Its about the curl parameters you put, you need to follow the moved/redirection

      Delete
  7. Hi,

    We implemented something similar to this using java. Right now we are stuck with the issue of google blocking the search as its not from a "human". Since this capability is not exposed as an API, is there a workaround to get around this?

    ReplyDelete
    Replies
    1. Hi, do you mean that it's asking for captcha? Then you must be really overusing it :)
      I did have this issue sometimes when did to omuch reverse image searching and just had to wait some hours. To not get this is to have between search queries some wait time like 1-2 seconds or more advanced stuff to use proxies etc

      Delete
  8. Please post the complete code using the new supposedly working methods?

    ReplyDelete
    Replies
    1. Hi, you can find the updated functions here
      https://gist.github.com/skyzer/24f80640e99070ec83bc
      the ->find(...) is the simplehtmldom librarys selector that is choosing correctly

      Delete
  9. Hi, Thank you for your explanation. I am new to PHP. I used your code, but I got the error : Parse error: syntax error, unexpected T_VARIABLE in "$pic_url ='http://kaizern.com/blog/beautiful-landscapes-1.jpg';"

    I don't know how to fix it.
    Could you please help me?

    ReplyDelete
  10. Hi, thank you for your hardwork!!! i have a problem and hope you can advise me.
    $pic_url = 'http://img4.wikia.nocookie.net/__cb20130117033701/fairytail/images/thumb/0/00/Character_Slider_no_2.jpg/670px-Character_Slider_no_2.jpg';

    i also deleted the top 3 include line. this file was placed in my xampp folder. i keep getting nothing found. could you please advise?

    ReplyDelete
  11. This code doesn't seem to work for me these days. It still returns a 302 page despite the CURLOPT_FOLLOWLOCATION being set to true. Any ideas on why?

    ReplyDelete
  12. Reverse image search is mainly finding the reverse image or original image source from web. There are many bloggers and content writers who want to write the similar topic which is already written by someone else. In that case finding the similar images is very important. It's really very hard to find out the similar image on internet. Reverse image search tool will do this job for you.

    ReplyDelete
  13. You happened to be one in all people who have worked day and night to make distinctive artistic otherwise you happened to transfer original and awing photos on Flickr, Pinterestand at some point you saw your exposure on one massive web site with no credit to your profile. however will that cause you to feel? have you ever ever thought, if you'll notice all the pages on the web, United Nations agency derived your pictures ? Welcome, to the globe of Reverse image search, which is able to allow you to search copy of a picture on entire net.
    Use: Reverse image search tool for free

    ReplyDelete
  14. Looking for apartments in a new city? Make sure to Reverse image search any images to avoid scams. Just Upload your image or find through url.

    ReplyDelete