Friday, 28 February 2014

Recovering from HTTP errors using URL Handlers

This article shows how URL handers, defined by urllib2, can be employed in practice in order to circumvent troubles we usually find when we write robots for collecting information from the Internet.

First things first (and usually a source of confusion): There are two sister libraries in Python which address retrieval of information from URLs; they are: urllib and urllib2. Conceptually, urllib2 works as a derived class of urllib. Just conceptually, because the actual implementation does not employ classes as a conventional object oriented paradigm would dictate.

If you are seeking detailed documentation about these libraries, I'm afraid to inform that your only choice is spending a couple of hours studying the source code of urllib.py and urllib2.py.

Setting an User Agent


OK. Now that you have the full documentation at hand, we can start. The first thing our robot needs to do is hiding its presence from the server side. One simple measure is employing an innocent user agent. We need to define a class derived from urllib2.BaseHandler which is responsible for setting the user agent before a request is sent to the server side. This is shown below:

import urllib2

class UserAgentProcessor(urllib2.BaseHandler):
    """A handler to add a custom UA string to urllib2 requests
    """
    def __init__(self, uastring):
        self.handler_order = 100
        self.ua = uastring

    def http_request(self, request):
        request.add_header("User-Agent", self.ua)
        return request

    https_request = http_request


(credits: This code was shamelessly copied from this article by Andrew Rowls)

Handling HTTP ERROR 404 (Not Found)


There are other things we need to do, such as throttling our requests, otherwise the server side will easily guess that there's a robot on our side sending dozens of requests per second. But throttling is a subject that I'm not going to cover here. You can later create your throttling handler, after you get better acquainted with some techniques covered in this article.

Some webservers are really busy, which may cause failures to our requests. Other webservers deliberatly reject requests given certain circumstances, for example: the server side may detect that we are sending dozens of requests per second and may decide to punish us for 10 minutes. Again we are back to the subject of throttling, which we are not going to cover here. But let's address this sort of issue partially, which may be of practical use in a majority of situations.

Let's say the webserver responds HTTP ERROR 404 (Not Found) eventually (or even regularly), even when the resource is existing in reality. We just need to be a little skeptic and send another request after waiting a couple of seconds. Eventually we need to be far more skeptic (or a little stubborn, if you will) and send several additional requests, before we become sure enough that the resource is actually and truly non-existent.

What we need to do is basically stamp requests so that we will have means to determine whether a request needs to be sent again to the server side, eventually waiting some time before that. Also, requests to different webservers may require different parameters for number of retries and for the delay to be employed. See below how we implemented this things:


import urllib2

class HTTPNotFoundHandler(urllib2.BaseHandler):
    """A handler which retries access to resources when 404 (NotFound) is received
    """

    handler_order = 600 # before HTTPDigestAuthHandler and ProxyDigestAuthHandler

    def __init__(self, retries=5, delay=2):
        self.retries = int(retries)
        self.delay   = float(delay)
        assert(self.retries >= 1)
        assert(self.delay >= 0.0)

    def http_request(self, req):
        if hasattr(req, 'headers') and 'Error_404' in req.headers:
            Error_404 = req.headers['Error_404']
            assert(int(Error_404['retries']) >= 1)
            assert(float(Error_404['delay']) >= 0.0)
        return req

    def http_error_404(self, req, fp, code, msg, headers):
        if hasattr(req, 'headers') and 'Error_404' in req.headers:
            Error_404 = req.headers['Error_404']
        else:
            Error_404 = dict()
            Error_404['delay']   = self.delay
            Error_404['retries'] = self.retries

        count   = Error_404['count'] if 'count' in Error_404 else 1
        retries = Error_404['retries']
        delay   = Error_404['delay']
        if count == retries:
            raise urllib2.HTTPError(req.get_full_url(),
                                    code,
                                    msg,
                                    headers,
                                    fp)
        else:
            # Don't close the fp until we are sure that

            # we won't use it with HTTPError.
            fp.read()
            fp.close()
            # sleep a little while
            from time import sleep
            sleep(delay)
            # send another request
            Error_404['count'] = count + 1
            req.add_header('Error_404', Error_404)
            return self.parent.open(req)

    https_error_404 = http_error_404


Now, let's add two utility functions:

def install_opener(opener=None):
    import urllib2
    if opener is None:
        urllib2.install_opener(build_opener())
    else:
        urllib2.install_opener(opener)
    return urllib2

def build_opener(

                  user_agent='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0',
                  http_404_retries=3,
                  http_404_delay=2.0):
    return urllib2.build_opener(
        UserAgentProcessor(user_agent),
        HTTPNotFoundHandler(http_404_retries, http_404_delay) )


Just put all the code you see in this up to this point into a file, say: api.py.

Test cases


Now, let's  create some test cases for it, using pytest. First thing consists on creating the conftest.py file, like shown below:

from __future__ import print_function

from pytest import fixture

@fixture
def opener():
    from mypackage.api import api
    return api.build_opener()

@fixture
def urllib2(opener):
    from mypackage.api import api
    return api.install_opener(opener)


If you are not acquainted to pytest, a very brief explanation of the code above is that we are defining functions opener and urllib2 which we will later employ as parameters to other functions. In a nutshell, pytest replaces the parameter by a call to the special functions (marked by @fixture) we have defined.

Now, let's create a file for test cases called test_urllib.py, like shown below:

import pytest

class TestOpeners(object):

    def xtest_build_opener(self, opener):
        pass

    def xtest_existing(self, urllib2):
        url = 'http://google.com'
        f = urllib2.urlopen(url)
        assert(f.code == 200)

    def xtest_existing_but_faulty(self, urllib2):
        url = 'http://biz.yahoo.com/p/'
        f = urllib2.urlopen(url)
        assert(f.code == 200)

    def xtest_non_existing(self, urllib2):
        from urllib2 import HTTPError
        url = 'http://google.com/this_url_does_not_exist'
        with pytest.raises(HTTPError):
            f = urllib2.urlopen(url)

    def test_non_existing_with_header(self, urllib2):
        from urllib2 import HTTPError
        url = 'http://google.com/this_url_does_not_exist'
        req = urllib2.Request(url, headers = {
            'Error_404'  : { 'retries': 5,
                             'delay'  : 2.0 }})
        with pytest.raises(HTTPError):
            f = urllib2.urlopen(req)

    def test_wrong_header_retries_1(self, urllib2):
        from urllib2 import HTTPError
        url = 'http://google.com'
        req = urllib2.Request(url, headers = {
            'Error_404' : { 'retries': 'rubbish',
                            'delay'  : 2.0 }})
        with pytest.raises(ValueError):
            f = urllib2.urlopen(req)

    def test_wrong_header_retries_2(self, urllib2):
        from urllib2 import HTTPError
        url = 'http://google.com/this_url_does_not_exist'
        req = urllib2.Request(url, headers = {
            'Error_404' : { 'retries': 0,
                            'delay'  : 2.0 }})
        with pytest.raises(AssertionError):
            f = urllib2.urlopen(req)


Conclusion

You can have better and more robust control of requests without even touching your application code by installing a custom opener to urllib2.

Thursday, 13 February 2014

Strong type checking in Python

This article describes a Python annotation which combines documentation with type checking in order to help Python developers to gain better understanding and control of the code, whilst allowing them to catch mistakes on the spot, as soon as they occur.

Being a Java developer previously but extradited to Python by my own choice, I sometimes feel some nostalgy from the old times, when the Java compiler used to tell me all sorts of stupidities I used to do.

In the Python world no one is stupid obviously, except probably me who many times find myself passing wrong types of arguments by accident or by pure stupidity, in case you accept the hypothesis that there's any difference between the two situations.

When you are coding your own stuff, chances are that you know very well what is going on. In general, you have the entire bloody API alive and kicking inside your head. But when you are learning some third party software, in particular large frameworks, chances are that your code is called by something you don't understand very well, which decides to pass arguments to your code which you do not have a clue what they are about.


Documentation

Documentation is a good way of sorting out this difficulty. Up-to-date documentation, in particular, is the sort of thing I feel extremely happy when I have chance to find one. My mood is being constantly crunched these days, if you understand what I mean.

Outdated documentation is not only useless but also undesirable. Possibly for this reason some (or many?) people prefer no documentation at all, since absence of information is better than misinformation, they defend.

It's very difficult to keep documentation up-to-date, unless you are forced somehow to do so. Maybe at gun point?


Strong type checking

I'm not in the quest of convincing anyone that strong type checking is good or useful or desirable.  Like everything in life, there are pros and cons.

On the other hand, I'd like to present a couple of benefits which keep strong type checking in my wishlist:

* I'd like to have the ability to stop the application as soon as a wrong type is received by a function or returned by a function to its caller. Stop early, catch mistakes easily, immediately, on spot.

* I'd like to identify and document argument types being passed by frameworks to my code, easily, quickly, effectively, without having to turn the Internet upside down every time I'm interested to learn what argument x is about.


Introducing sphinx_typesafe

Doing a bit of research, I found an interesting library called IcanHasTypeCheck (or ICHTC for short), which I ended up rewriting almost from scratch during the last revision and I've renamed it to sphinx_typesafe.

Let me explain the idea:

In the docstring of a function or method, you employ Sphinx-style documentation patterns in order to tell types associated to variables.

If your documentation is pristine, the number of arguments in the documentation match the number of arguments in the function or method definition.

If your logic is pristine, the types of arguments you documented match the types of arguments actually passed to the function or method at runtime, or returned by the function or method to the caller, at runtime.

You just need to add an annotation @typesafe before the function or method, and sphinx_typesafe checks if the documentation matches the definition.

If you don't have a clue about the type of an argument, simply guess some unlikely type, say: None. Then run the application and sphinx_typesafe will interrupt the execution of it and report that the actual type does not match None. The next step is obviously substitute None by the actual type.


Benefits

A small example tells more than several paragraphs.
Imagine that you see some code like this:

    import math
   
    def d(p1, p2):
        x = p1.x - p2.x
        y = p1.y - p2.y
        return math.sqrt(x*x + y*y)



Imagine that you had type information about it, like this:


    import math
    from sphinx_typesafe import typesafe

    @typesafe
    def d(p1, p2):
        """
        :type p1: shapes.Point
        :type p2: shapes.Point
        :rtype  : float
        """
        x = p1.x - p2.x
        y = p1.y - p2.y
        return math.sqrt(x*x + y*y)


Now you are able to understand what this code is about, quickly!.
In particular, you are able to tell what it is the domain of types this code is intended to operate on.

When you run this code, if this function receives a shapes.Square instead of a shape.Point, it would stop immediately. Notice that, eventually, a shape.Square may have components x and y which would make the function return wrong results silently. Imagine your test cases catching this situation!

So, I hope I demonstrated the two benefits I was interested on.



Missing Features


Polymorphism

Sometimes I would like to tell that an argument can be a file but also a str. At the moment I can say that the argument can be types.NotImplementedType meaning "any type". But I would like something more precise, like this:

    :type f: [file, str]
   
This is not difficult to implement, actually, but we are not there yet.


Non intrusive

I would like to have a non intrusive way to turn on type checking and a very cheap way of turning off type checking, if possible without any code change.

Thinking more about use cases, I guess that type checking is very useful when you are developing and, in particular, when you are running your test suite. You are probably not interested on having the overhead of type checking on production code which was theoretically exhaustively tested.

Long story short, I would like to integrate sphinx_typesafe with pytest, so that an automatic decoration of functions and methods would happen automagically and without any code change.

If pytest finds a docstring which happens to contain a Sphinx-style type specification on it, @typesafe is applied to the function or method. That would be really nice! You could also run your code in production without type checking since type checking was never turned on in the first place.

The idea looks to be great, but my ignorance on pytest internals and my limited time prevents me of going ahead. Maybe in future!


Python3 support

The sources of sphinx_typesafe itself are ready for Python3, but sphinx_typesafe does not handle properly your sources written in Python3 yet. It's not difficult to implement, actually: it's just a matter of adjusting one function, but we are not there yet. Maybe you feel compelled to contribute?


More Information

https://pypi.python.org/pypi/sphinx_typesafe


Credits

Thank Klaas for inspiration and his IcanHasTypeCheck (or ICHTC for short).