Tag: Python

Python Imap Gmail
Connecting to a Google Gmail account is easy with Python using the built in imaplib library. It’s possible to download, read, mark and delete messages in your gmail account by scripting it.

Here’s a very simple script that prints out the latest email received:
```
#!/usr/bin/env python

import imaplib
M=imaplib.IMAP4_SSL('imap.gmail.com', 993)
M.login('myemailaddress@gmail.com','password')
status, count = M.select('Inbox')
status, data = M.fetch(count[0], '(UID BODY[TEXT])')

print data[0][1]
M.close()
M.logout()
```
As you can see. Not a lot of code required to login and check and email. However, imaplib provides just a very thin layer on the imap protocol and you’ll have to refer to the documentation on how imap works and the commands available to really use imaplib. As you can see in the fetch command the “(UID BODY[TEXT])” bit is a raw imap instruction. In this case I’m calling fetch with the size of the Inbox folder because the most recent email is listed last (uid of most recent message is count) and telling it to return the body text of the email. There are many more complex ways to navigate an imap inbox. I recommend playing with it in the interpreter and connecting directly to the server with telnet to understand exactly what is happening.

Here’s a good resource for quickly getting up to speed with IMAP Accessing IMAP email accounts using telnet
September 23, 2010

Python Web Crawler Script

Here’s a simple web crawling script that will go from one url and find all the pages it links to up to a pre-defined depth. Web crawling is of course the lowest level tool used by Google to create its multi-billion dollar business. You may not be able to compete with Google’s search technology but being able to crawl your own sites, or that of your competitors can be very valuable.

You could for instance routinely check your websites to make sure that it is live and all the links are working. it could notify you of any 404 errors. By adding in a page rank check you could identify better linking strategies to boost your page rank scores. And you could identify possible leaks – paths a user could take that takes them away from where you want them to go.

Here’s the script:

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        HTMLParser.__init__(self)
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]

        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0

    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')

            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
                self.node.append(link)
            self.db[link] = (self.db.get(link) or 0) + 1

    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                try:
                    req = urlopen(self.url)
                    res = req.read()
                    self.feed(res)
                except:
                    self.reset()
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted

if __name__ == "__main__":
    spidey = Spider(starting_url = 'http://www.7cerebros.com.ar', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

September 16, 2009

Amazon Product Advertising API From Python

Amazon has a very comprehensive associate program that allows you to promote just about anything imaginable for any niche and earn commission for anything you refer. The size of the catalog is what makes Amazon such a great program. People make some good money promoting Amazon products.

There is a great Python library out there for accessing the other Amazon web services such as S3, and EC2 called boto. However it doesn’t support the Product Advertising API.

With the Product Advertising API you have access to everything that you can read on the Amazon site about each product. This includes the product description, images, editor reviews, customer reviews and ratings. This is a lot of great information that you could easily find a good use for with your websites.

So how do you get at this information from within a Python program? Well the complicated part is dealing with the authentication that Amazon has put in place. To make that a bit easier I used the connection component from boto.

Here’s a demonstration snippet of code that will print out the top 10 best selling books on Amazon right now.

Example Usage:

$ python AmazonSample.py
Glenn Becks Common Sense: The Case Against an Out-of-Control Government, Inspired by Thomas Paine by Glenn Beck
Culture of Corruption: Obama and His Team of Tax Cheats, Crooks, and Cronies by Michelle Malkin
The Angel Experiment (Maximum Ride, Book 1) by James Patterson
The Time Travelers Wife by Audrey Niffenegger
The Help by Kathryn Stockett
South of Broad by Pat Conroy
Paranoia by Joseph Finder
The Girl Who Played with Fire by Stieg Larsson
The Shack [With Headphones] (Playaway Adult Nonfiction) by William P. Young
The Girl with the Dragon Tattoo by Stieg Larsson

To use this code you’ll need an Amazon associate account and fill out the keys and tag needed for authentication.

Product Advertising API Python code:

#!/usr/bin/env python
# encoding: utf-8
"""
AmazonExample.py

Created by Matt Warren on 2009-08-17.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""

import urllib
try:
    from xml.etree import ET
except ImportError:
    from elementtree import ET
    
from boto.connection import AWSQueryConnection

AWS_ACCESS_KEY_ID = 'YOUR ACCESS KEY'
AWS_ASSOCIATE_TAG = 'YOUR TAG'
AWS_SECRET_ACCESS_KEY = 'YOUR SECRET KEY'

def amazon_top_for_category(browseNodeId):
    aws_conn = AWSQueryConnection(
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY, is_secure=False,
        host='ecs.amazonaws.com')
    aws_conn.SignatureVersion = '2'
    params = dict(
        Service='AWSECommerceService',
        Version='2009-07-01',
        SignatureVersion=aws_conn.SignatureVersion,
        AWSAccessKeyId=AWS_ACCESS_KEY_ID,
        AssociateTag=AWS_ASSOCIATE_TAG,
        Operation='ItemSearch',
        BrowseNode=browseNodeId,
        SearchIndex='Books',
        ResponseGroup='ItemAttributes,EditorialReview',
        Order='salesrank',
        Timestamp=time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()))
    verb = 'GET'
    path = '/onca/xml'
    qs, signature = aws_conn.get_signature(params, verb, path)
    qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
    response = aws_conn._mexe(verb, qs, None, headers={})
    tree = ET.fromstring(response.read())
    
    NS = tree.tag.split('}')[0][1:]

    for item in tree.find('{%s}Items'%NS).findall('{%s}Item'%NS):
        title = item.find('{%s}ItemAttributes'%NS).find('{%s}Title'%NS).text
        author = item.find('{%s}ItemAttributes'%NS).find('{%s}Author'%NS).text
        print title, 'by', author

if __name__ == '__main__':
    amazon_top_for_category(1000) #Amazon category number for US Books

August 17, 2009

Get Your ClickBank Transactions Into Sqlite With Python

Clickbank is an amazing service that allows anyone to easily to either as a publisher create and sell information products or as an advertiser sell other peoples products for a commission. Clickbank handles the credit card transactions, and refunds while affiliates can earn as much as 90% of the price of the products as commission. It’s a pretty easy to use system and I have used it both as a publisher and as an affiliate to make significant amounts of money online.

The script I have today is a Python program that uses Clickbank’s REST API to download the latest transactions for your affiliate IDs and stuffs the data into a database.

The reason for doing this is that it keeps the data in your control and allows you to more easily see all of the transactions for all your accounts in one place without having to go to clickbank.com and log in to your accounts constantly. I’m going to be including this data in my Business Intelligence Dashboard Application

One of the new things I did while writing this script was made use of SQLAlchemy to abstract the database. This means that it should be trivial to convert it over to use MySQL – just change the connection string.

Also you should note that to use this script you’ll need to get the “Clerk API Key” and the “Developer API Key” from your Clickbank account. To generate those keys go to the Account Settings tab from the account dashboard. If you have more than one affiliate ID then you’ll need one Clerk API Key per affiliate ID.

This is the biggest script I have shared on this site yet. I hope someone finds it useful.

Here’s the code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# https://www.mattwarren.co/

import csv
import httplib
import logging

from sqlalchemy import Table, Column, Integer, String, MetaData, Date, DateTime, Float
from sqlalchemy.schema import UniqueConstraint
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

LOG_FILENAME = 'ClickbankLoader.log'
logging.basicConfig(filename=LOG_FILENAME,level=logging.DEBUG,filemode='w')

#generate these keys in the Account Settings area of ClickBank when you log in.
ACCOUNTS = [{'account':'YOUR_AFFILIATE_ID',  'API_key': 'YOUR_API_KEY' },]
DEV_API_KEY = 'YOUR_DEV_KEY'

CONNSTRING='sqlite:///clickbank_stats.sqlite'

Base = declarative_base()
class ClickBankList(Base):
    __tablename__ = 'clickbanklist'
    __table_args__ = (UniqueConstraint('date','receipt','item'),{})

    id                 = Column(Integer, primary_key=True)
    account            = Column(String)
    processedPayments  = Column(Integer)
    status             = Column(String)
    futurePayments     = Column(Integer)
    firstName          = Column(String)
    state              = Column(String)
    promo              = Column(String)
    country            = Column(String)
    receipt            = Column(String)
    pmtType            = Column(String)
    site               = Column(String)
    currency           = Column(String)
    item               = Column(String)
    amount             = Column(Float)
    txnType            = Column(String)
    affi               = Column(String)
    lastName           = Column(String)
    date               = Column(DateTime)
    rebillAmount       = Column(Float)
    nextPaymentDate    = Column(DateTime)
    email              = Column(String)
    
    format = '%Y-%m-%dT%H:%M:%S'
    
    def __init__(self, account, processedPayments, status, futurePayments, firstName, state, promo, country, receipt, pmtType, site, currency, item, amount , txnType, affi, lastName, date, rebillAmount, nextPaymentDate, email):
        self.account            = account
        if processedPayments != '':
        	self.processedPayments  = processedPayments
        self.status             = status
        if futurePayments != '':
            self.futurePayments     = futurePayments
        self.firstName          = firstName
        self.state              = state
        self.promo              = promo
        self.country            = country
        self.receipt            = receipt
        self.pmtType            = pmtType
        self.site               = site
        self.currency           = currency
        self.item               = item
        if amount != '':
        	self.amount             = amount 
        self.txnType            = txnType
        self.affi               = affi
        self.lastName           = lastName
        self.date               = datetime.strptime(date[:19], self.format)
        if rebillAmount != '':
        	self.rebillAmount       = rebillAmount
        if nextPaymentDate != '':
        	self.nextPaymentDate    = datetime.strptime(nextPaymentDate[:19], self.format)
        self.email              = email

    def __repr__(self):
        return "" % (self.account, self.date, self.receipt, self.item)

def get_clickbank_list(API_key, DEV_key):
    conn = httplib.HTTPSConnection('api.clickbank.com')
    conn.putrequest('GET', '/rest/1.0/orders/list')
    conn.putheader("Accept", 'text/csv')
    conn.putheader("Authorization", DEV_key+':'+API_key)
    conn.endheaders()
    response = conn.getresponse()
    
    if response.status != 200:
        logging.error('HTTP error %s' % response)
        raise Exception(response)
    
    csv_data = response.read()
    
    return csv_data

def load_clickbanklist(csv_data, account, dbconnection=CONNSTRING, echo=False):
    engine = create_engine(dbconnection, echo=echo)

    metadata = Base.metadata
    metadata.create_all(engine) 

    Session = sessionmaker(bind=engine)
    session = Session()

    data = csv.DictReader(iter(csv_data.split('\n')))

    for d in data:
        item = ClickBankList(account, **d)
        #check for duplicates before inserting
        checkitem = session.query(ClickBankList).filter_by(date=item.date, receipt=item.receipt, item=item.item).all()
    
        if not checkitem:
            logging.info('inserting new transaction %s' % item)
            session.add(item)

    session.commit()
    
if  __name__=='__main__':
    try:
        for account in ACCOUNTS:
            csv_data = get_clickbank_list(account['API_key'], DEV_API_KEY)
            load_clickbanklist(csv_data, account['account'])
    except:
        logging.exception('Crashed')

August 14, 2009

Scrape Google Search Results Page

Here’s a short script that will scrape the first 100 listings in the Google Organic results.

You might want to use this to find the position of your sites and track their position for certain target keyword phrases over time. That could be a very good way to determine, for example, if your SEO efforts are working. Or you could use the list of URLs as a starting point for some other web crawling activity

As the script is written it will just dump the list of URLs to a txt file.

It uses the BeautifulSoup library to help with parsing the HTML page.

Example Usage:

$ python GoogleScrape.py $ cat links.txt

Home

https://www.mattwarren.co/2009/07/01/rss-twitter-bot-in-python/ http://www.blogcatalog.com/blogs/halotis.html http://www.blogcatalog.com/topic/sqlite/ http://ieeexplore.ieee.org/iel5/10358/32956/01543043.pdf?arnumber=1543043 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1543043 http://doi.ieeecomputersociety.org/10.1109/DATE.2001.915065 http://rapidlibrary.com/index.php?q=hal+otis http://www.tagza.com/Software/Video_tutorial_-_URL_re-directing_software-___HalOtis/ http://portal.acm.org/citation.cfm?id=367328 http://ag.arizona.edu/herbarium/db/get_taxon.php?id=20605&show_desc=1 http://www.plantsystematics.org/taxpage/0/genus/Halotis.html http://www.mattwarren.name/ http://www.mattwarren.name/2009/07/31/net-worth-update-3-5/ http://newweightlossdiet.com/privacy.php http://www.ingentaconnect.com/content/nisc/sajms/1988/00000006/00000001/art00002?crawler=true http://www.ingentaconnect.com/content/nisc/sajms/2000/00000022/00000001/art00013?crawler=true

Click to access etm69yghjva13xlh.pdf

Click to access b7fytc095bc57x59.pdf

...... $

Here’s the script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# https://www.mattwarren.co/

import urllib,urllib2

from BeautifulSoup import BeautifulSoup

def google_grab(query):

    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
    
    soup = BeautifulSoup(page)
    links =   [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
    
    return links

if __name__=='__main__':
    # Example: Search written to file
    links = google_grab('halotis')
    open("links.txt","w+b").write("\n".join(links))

August 8, 2009

Google Page Rank Python Script

This isn’t my script but I thought it would appeal to the reader of this blog.Â It’s a script thatÂ will lookup the Google Page Rank for any website and uses the same interface as the Google Toolbar to do it. I’d like to thank Fred Cirera for writing it and you can checkout his blog about this script here.

I’m not exactly sure what I would use this for but it might have applications for anyone who wants to do some really advanced SEO work and find a real way to accomplish Page Rank sculpting. Perhaps finding the best websites to put links on.

The reason it is such an involved bit of math is that it need to compute a checksum in order to work. It should be pretty reliable since it doesn’t involve and scraping.

Example usage:

$ python pagerank.py http://www.google.com/
PageRank: 10	URL: http://www.google.com/

$ python pagerank.py http://www.mozilla.org/
PageRank: 9	URL: http://www.mozilla.org/

$ python pagerank.py https://www.mattwarren.co/
PageRange: 3   URL: https://www.mattwarren.co/

And the script:

#!/usr/bin/env python
#
#  Script for getting Google Page Rank of page
#  Google Toolbar 3.0.x/4.0.x Pagerank Checksum Algorithm
#
#  original from http://pagerank.gamesaga.net/
#  this version was adapted from http://www.djangosnippets.org/snippets/221/
#  by Corey Goldberg - 2010
#
#  Licensed under the MIT license: http://www.opensource.org/licenses/mit-license.php



import urllib


def get_pagerank(url):
    hsh = check_hash(hash_url(url))
    gurl = 'http://www.google.com/search?client=navclient-auto&features=Rank:&q=info:%s&ch=%s' % (urllib.quote(url), hsh)
    try:
        f = urllib.urlopen(gurl)
        rank = f.read().strip()[9:]
    except Exception:
        rank = 'N/A'
    if rank == '':
        rank = '0'
    return rank
    
    
def  int_str(string, integer, factor):
    for i in range(len(string)) :
        integer *= factor
        integer &= 0xFFFFFFFF
        integer += ord(string[i])
    return integer


def hash_url(string):
    c1 = int_str(string, 0x1505, 0x21)
    c2 = int_str(string, 0, 0x1003F)

    c1 >>= 2
    c1 = ((c1 >> 4) & 0x3FFFFC0) | (c1 & 0x3F)
    c1 = ((c1 >> 4) & 0x3FFC00) | (c1 & 0x3FF)
    c1 = ((c1 >> 4) & 0x3C000) | (c1 & 0x3FFF)

    t1 = (c1 & 0x3C0) < < 4
    t1 |= c1 & 0x3C
    t1 = (t1 << 2) | (c2 & 0xF0F)

    t2 = (c1 & 0xFFFFC000) << 4
    t2 |= c1 & 0x3C00
    t2 = (t2 << 0xA) | (c2 & 0xF0F0000)

    return (t1 | t2)


def check_hash(hash_int):
    hash_str = '%u' % (hash_int)
    flag = 0
    check_byte = 0

    i = len(hash_str) - 1
    while i >= 0:
        byte = int(hash_str[i])
        if 1 == (flag % 2):
            byte *= 2;
            byte = byte / 10 + byte % 10
        check_byte += byte
        flag += 1
        i -= 1

    check_byte %= 10
    if 0 != check_byte:
        check_byte = 10 - check_byte
        if 1 == flag % 2:
            if 1 == check_byte % 2:
                check_byte += 9
            check_byte >>= 1

    return '7' + str(check_byte) + hash_str



if __name__ == '__main__':
    if len(sys.argv) != 2:
        url = 'http://www.google.com/'
    else:
        url = sys.argv[1]

    print get_pagerank(url)

August 2, 2009

Targeting Twitter Trends Script
I noticed that several accounts are spamming the twitter trends. Go to twitter.com and select one of the trends in the right column. You’ll undoubtedly see some tweets that are blatantly inserting words from the trending topics list into unrelated ads.

I was curious just how easy it would be to get the trending topics to target them with tweets. Turns out it is amazingly simple and shows off some of the beauty of Python.

This script doesn’t actually do anything with the trend information. It just simply downloads and prints out the list. But combine this code with the sample code from
RSS Twitter Bot in Python and you’ll have a recipe for some seriously powerful promotion.
```
import simplejson  # http://undefined.org/python/#simplejson
import urllib

result = simplejson.load(urllib.urlopen('http://search.twitter.com/trends.json'))

print [trend['name'] for trend in result['trends']]
```
July 25, 2009