urlBorg: why build yet another URL shortening service?

urlborg — Tags: , , — Panayotis @ 19:05

So, urlBorg has been rewritten in Python and is now hosted on Google AppEngine (make a note, the new address is urlborg.com).

But why build “yet another URL shortening service”, when it’s so easy to build one? Any web developer could build one in less than an hour, couldn’t they?

The truth is that building a URL shortening service is a trivial task. Building one that could scale is not. I designed urlBorg having in mind “will it work if it made it to TechCrunch or if CNN.com made extensive use of it?” Building such a service is not trivial, believe me. (And I won’t know if urlBorg will make it either, but I think it will.)

But scale wasn’t my only motivation. I believe there’s a lot of space to add value to such a simple service. A quick look at the API will reveal some of my ideas -urlBorg goes beyond returning a short URL.

More details to come soon :-)

a suggestion for efficient and scalable counters in Datastore

code samples, urlborg — Tags: , , — Panayotis @ 13:04

As I’ve mentioned before, I’m trying to migrate urlBorg to Google AppEngine. urlBorg needs to count many things, like clicks on a short URL, etc, so I really need a scalable and efficient way to implement counters. This is not as trivial as it sounds in the Google AppEngine environment.

This post is actually the result of a good discussion done here

Here is the code I’ve come up with.
An example usage would be as simple as adding a line like (where page_id is a unique string identifying each page)

Acc(page_id).acc()

in each one of your pages. Getting the total coun is as simple as

Acc(page_id).val()

(Due to the way the total count is calculated, this may not give accurate results if you are in the middle of a traffic spike, but it’s good enough for web analytics usage)

class AccVals(db.Model):
       cluster = db.StringProperty(required=True)
       count = db.IntegerProperty(required=True)
       updated = db.DateTimeProperty(auto_now=True)
       rand = db.FloatProperty()

class Acc():
       def __init__(self, name,init=0):
               self.__sec = 0.1
               self.__name = name
               self.__init = init

       def inc(self):
               def trans(key):
                       obj = AccVals.get(key)
                       obj.count += 1
                       obj.put()
                       self.__val = obj.count

               q = db.Query(AccVals).filter('cluster =',self.__name).filter('rand >', random.random()).get()
               if (q):
                       if (datetime.datetime.now() - q.updated < datetime .timedelta(0,self.__sec)):
                               obj = AccVals(cluster=self.__name,count=self.__init, rand=random.random() )
                               key = obj.put()
                       else:
                               key = q.key()
               else:
                       obj = AccVals(cluster=self.__name,count=self.__init, rand=1.0 )
                       key = obj.put()

               db.run_in_transaction(trans,key)
               return self.__val

       def val(self):
               total = 0
               q = AccVals.all()
               q.filter('cluster =',self.__name)
               for r in q:
                       total += r.count
               return total

It behaves relatively good and looks like it can scale no matter how
much traffic or traffic spikes you have.

If you look into it, you will see that a “counter instance” is chosen
in random. You may be tempted to use the “instance” that was updated
longer in the past ( order(’-updated’).get() ), but it turns out that
when you have a traffic spike (or whatever it is your counters count)
the indexes are not updated soon enough and this will return the last
records that were updated :-) It looks like selecting a random
instance is no big deal in low traffic and works much better in high
traffic. I’ve also seen that after a while, you end up with the number
of counter instances that are required to handle the traffic of the
specific counter with few transaction collisions.

There is one interesting point: the value of self.__sec. I set it to
0.1 seconds, but this is just a value that looked good after some
tests. I have the impression that this value is *related* to some kind
of “global AppEngine constant”, measuring the time it takes for a
transaction to complete and safely propagate to the rest of the
infrastructure. I guess this varies, depending on the resource
allocation done for a specific app. Could someone from the AppEngine
development team give us some insight on this?

As I’ve mentioned before, I’m a Python newbie, so use the code above
at your risk :-)

Please post your comments here, so that they are all in one place.

unique integer IDs in Google datastore

code samples — Tags: , , — Panayotis @ 09:04

update: A good discussion on the topics mentioned in this article can be found here, please read it before using the code :-)

newbie code ahead! Use at your own risk :-)

One of the first problems I faced when trying to build an application in Google AppEngine, was the lack of something like a “unique, auto_increment” column type in the datastore. How do I maintain a unique numeric id in a way that is guarantied work even under heavy use, and concurrent requests?

Here is some code I came up with, that seems to work. I’m a python newbie, so please don’t hesitate to point out any mistakes!

What’s more, I’m just going through the Google AppEngine quirks, so I’m not aware of how to optimize the code or of any performance considerations implied by it. Once again, any comments are more than welcome!

class Idx(db.Model):
        name = db.StringProperty(required=True)
        count = db.IntegerProperty(required=True)

class Counter():
        """Unique counters for Google Datastore.
        Usage: c=Counter('hits').inc() will increase the counter 'hits' by 1 and return the new value.
        When your application is run for the first time, you should call the create(start_value) method."""
        def __init__(self, name):
                self.__name = name
                res = db.GqlQuery("SELECT * FROM Idx WHERE name = :1 LIMIT 1", self.__name).fetch(1)
                if (len(res)==0):
                        self.__status = 0
                else:
                        self.__status = 1
                        self.__key = res[0].key()

        def create(self, start_value=0):
                “”"This method is NOT “thread safe”. Even though some testing is done,
                the developer is responsible make sure it is only called once for each counter.
                This should not be a problem, since it sould only be used during application installation.
                “”"

                res = db.GqlQuery(”SELECT * FROM Idx WHERE name = :1 LIMIT 1″, self.__name).fetch(1)
                if (len(res)==0):
                        C = Idx(name=self.__name, count=start_value)
                        self.__key = C.put()
                        self.__status = 1
                else:
                        raise ValueError, ‘Counter: ‘+ self.__name +’ already exists’

        def get(self):
                self.__check_sanity__()
                return db.get(self.__key).count

        def inc(self):
                self.__check_sanity__()
                db.run_in_transaction(self.__inc1__)
                return self.get()

        def __check_sanity__(self):
                if (self.__status==0):
                        raise ValueError, ‘Counter: ‘+self.__name+’ does not exist in Idx’
                else:
                        pass

        def __inc1__(self):
                obj = db.get(self.__key)
                obj.count += 1
                obj.put()

Suppose you have a Products class that looks like this

class Product(db.Model):
        Serial_ID = db.IntegerProperty(required=True)
        Name = db.TextProperty(required=True)

You should have an “installation page” that is only called once during your application installation and does something like this to create the counter Product_Serial_ID with initaial value 0.

s = Counter('Product_Serial_ID').create(0)

Calling the above code for a second time will raise an exception, but concurrent calls may have unexpected results.

Inserting a new product in the datastore:

P = Product(Serial_ID=Counter('Product_Serial_ID').inc(), Name='Product Name')
P.put()

Please note that if put() fails, the next time you try to insert the product you will get a new Product_Serial_ID. But at least you can be sure it’s unique and incremental :-)

AppEngine Datastore limitations

urlborg — Tags: , , — Panayotis @ 00:04

I’ve been trying to decide if moving urlBorg to Google App Engine is a good idea. The pros are obvious: scalability. There are many features I’ve wanted to implement for urlBorg but never did because I’m afraid that if it turns into a hit, my server will go down.

I mean, creating short URLs is a trivial thing. If you want to make a service that stands out, it has to be that it takes care of the little details in a much better way than the rest. And you have to be sure that your service will be able to scale.

So, moving urlBorg to Google App Engine should be a no brainer, right? Wrong.

My main issue is AppEngine Datastore.

The App Engine datastore is not a relational database. While the datastore interface has many of the same features of traditional databases, the datastore’s unique characteristics imply a different way of designing and managing data to take advantage of the ability to scale automatically.

So, forget about queries involving group functions like count(*), min(), max()… :-(

I wish they had some good examples on how to use the AppEngine Datastore to do data mining. How should/would a “web analytics” application be implemented using AppEngine for example?

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2008 vrypan|net|log | powered by WordPress with Barecity