Tag: google

Scrape Google Search Results Page

Here’s a short script that will scrape the first 100 listings in the Google Organic results.

You might want to use this to find the position of your sites and track their position for certain target keyword phrases over time. That could be a very good way to determine, for example, if your SEO efforts are working. Or you could use the list of URLs as a starting point for some other web crawling activity

As the script is written it will just dump the list of URLs to a txt file.

It uses the BeautifulSoup library to help with parsing the HTML page.

Example Usage:

$ python GoogleScrape.py $ cat links.txt

Home

https://www.mattwarren.co/2009/07/01/rss-twitter-bot-in-python/ http://www.blogcatalog.com/blogs/halotis.html http://www.blogcatalog.com/topic/sqlite/ http://ieeexplore.ieee.org/iel5/10358/32956/01543043.pdf?arnumber=1543043 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1543043 http://doi.ieeecomputersociety.org/10.1109/DATE.2001.915065 http://rapidlibrary.com/index.php?q=hal+otis http://www.tagza.com/Software/Video_tutorial_-_URL_re-directing_software-___HalOtis/ http://portal.acm.org/citation.cfm?id=367328 http://ag.arizona.edu/herbarium/db/get_taxon.php?id=20605&show_desc=1 http://www.plantsystematics.org/taxpage/0/genus/Halotis.html http://www.mattwarren.name/ http://www.mattwarren.name/2009/07/31/net-worth-update-3-5/ http://newweightlossdiet.com/privacy.php http://www.ingentaconnect.com/content/nisc/sajms/1988/00000006/00000001/art00002?crawler=true http://www.ingentaconnect.com/content/nisc/sajms/2000/00000022/00000001/art00013?crawler=true

Click to access etm69yghjva13xlh.pdf

Click to access b7fytc095bc57x59.pdf

...... $

Here’s the script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# https://www.mattwarren.co/

import urllib,urllib2

from BeautifulSoup import BeautifulSoup

def google_grab(query):

    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
    
    soup = BeautifulSoup(page)
    links =   [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
    
    return links

if __name__=='__main__':
    # Example: Search written to file
    links = google_grab('halotis')
    open("links.txt","w+b").write("\n".join(links))

August 8, 2009

Google Page Rank Python Script

This isn’t my script but I thought it would appeal to the reader of this blog.Â It’s a script thatÂ will lookup the Google Page Rank for any website and uses the same interface as the Google Toolbar to do it. I’d like to thank Fred Cirera for writing it and you can checkout his blog about this script here.

I’m not exactly sure what I would use this for but it might have applications for anyone who wants to do some really advanced SEO work and find a real way to accomplish Page Rank sculpting. Perhaps finding the best websites to put links on.

The reason it is such an involved bit of math is that it need to compute a checksum in order to work. It should be pretty reliable since it doesn’t involve and scraping.

Example usage:

$ python pagerank.py http://www.google.com/
PageRank: 10	URL: http://www.google.com/

$ python pagerank.py http://www.mozilla.org/
PageRank: 9	URL: http://www.mozilla.org/

$ python pagerank.py https://www.mattwarren.co/
PageRange: 3   URL: https://www.mattwarren.co/

And the script:

#!/usr/bin/env python
#
#  Script for getting Google Page Rank of page
#  Google Toolbar 3.0.x/4.0.x Pagerank Checksum Algorithm
#
#  original from http://pagerank.gamesaga.net/
#  this version was adapted from http://www.djangosnippets.org/snippets/221/
#  by Corey Goldberg - 2010
#
#  Licensed under the MIT license: http://www.opensource.org/licenses/mit-license.php



import urllib


def get_pagerank(url):
    hsh = check_hash(hash_url(url))
    gurl = 'http://www.google.com/search?client=navclient-auto&features=Rank:&q=info:%s&ch=%s' % (urllib.quote(url), hsh)
    try:
        f = urllib.urlopen(gurl)
        rank = f.read().strip()[9:]
    except Exception:
        rank = 'N/A'
    if rank == '':
        rank = '0'
    return rank
    
    
def  int_str(string, integer, factor):
    for i in range(len(string)) :
        integer *= factor
        integer &= 0xFFFFFFFF
        integer += ord(string[i])
    return integer


def hash_url(string):
    c1 = int_str(string, 0x1505, 0x21)
    c2 = int_str(string, 0, 0x1003F)

    c1 >>= 2
    c1 = ((c1 >> 4) & 0x3FFFFC0) | (c1 & 0x3F)
    c1 = ((c1 >> 4) & 0x3FFC00) | (c1 & 0x3FF)
    c1 = ((c1 >> 4) & 0x3C000) | (c1 & 0x3FFF)

    t1 = (c1 & 0x3C0) < < 4
    t1 |= c1 & 0x3C
    t1 = (t1 << 2) | (c2 & 0xF0F)

    t2 = (c1 & 0xFFFFC000) << 4
    t2 |= c1 & 0x3C00
    t2 = (t2 << 0xA) | (c2 & 0xF0F0000)

    return (t1 | t2)


def check_hash(hash_int):
    hash_str = '%u' % (hash_int)
    flag = 0
    check_byte = 0

    i = len(hash_str) - 1
    while i >= 0:
        byte = int(hash_str[i])
        if 1 == (flag % 2):
            byte *= 2;
            byte = byte / 10 + byte % 10
        check_byte += byte
        flag += 1
        i -= 1

    check_byte %= 10
    if 0 != check_byte:
        check_byte = 10 - check_byte
        if 1 == flag % 2:
            if 1 == check_byte % 2:
                check_byte += 9
            check_byte >>= 1

    return '7' + str(check_byte) + hash_str



if __name__ == '__main__':
    if len(sys.argv) != 2:
        url = 'http://www.google.com/'
    else:
        url = sys.argv[1]

    print get_pagerank(url)

August 2, 2009