a suggestion for efficient and scalable counters in Datastore

code samples, urlborg — Tags: , , — Panayotis @ 13:04

As I’ve mentioned before, I’m trying to migrate urlBorg to Google AppEngine. urlBorg needs to count many things, like clicks on a short URL, etc, so I really need a scalable and efficient way to implement counters. This is not as trivial as it sounds in the Google AppEngine environment.

This post is actually the result of a good discussion done here

Here is the code I’ve come up with.
An example usage would be as simple as adding a line like (where page_id is a unique string identifying each page)

Acc(page_id).acc()

in each one of your pages. Getting the total coun is as simple as

Acc(page_id).val()

(Due to the way the total count is calculated, this may not give accurate results if you are in the middle of a traffic spike, but it’s good enough for web analytics usage)

class AccVals(db.Model):
       cluster = db.StringProperty(required=True)
       count = db.IntegerProperty(required=True)
       updated = db.DateTimeProperty(auto_now=True)
       rand = db.FloatProperty()

class Acc():
       def __init__(self, name,init=0):
               self.__sec = 0.1
               self.__name = name
               self.__init = init

       def inc(self):
               def trans(key):
                       obj = AccVals.get(key)
                       obj.count += 1
                       obj.put()
                       self.__val = obj.count

               q = db.Query(AccVals).filter('cluster =',self.__name).filter('rand >', random.random()).get()
               if (q):
                       if (datetime.datetime.now() - q.updated < datetime .timedelta(0,self.__sec)):
                               obj = AccVals(cluster=self.__name,count=self.__init, rand=random.random() )
                               key = obj.put()
                       else:
                               key = q.key()
               else:
                       obj = AccVals(cluster=self.__name,count=self.__init, rand=1.0 )
                       key = obj.put()

               db.run_in_transaction(trans,key)
               return self.__val

       def val(self):
               total = 0
               q = AccVals.all()
               q.filter('cluster =',self.__name)
               for r in q:
                       total += r.count
               return total

It behaves relatively good and looks like it can scale no matter how
much traffic or traffic spikes you have.

If you look into it, you will see that a “counter instance” is chosen
in random. You may be tempted to use the “instance” that was updated
longer in the past ( order(’-updated’).get() ), but it turns out that
when you have a traffic spike (or whatever it is your counters count)
the indexes are not updated soon enough and this will return the last
records that were updated :-) It looks like selecting a random
instance is no big deal in low traffic and works much better in high
traffic. I’ve also seen that after a while, you end up with the number
of counter instances that are required to handle the traffic of the
specific counter with few transaction collisions.

There is one interesting point: the value of self.__sec. I set it to
0.1 seconds, but this is just a value that looked good after some
tests. I have the impression that this value is *related* to some kind
of “global AppEngine constant”, measuring the time it takes for a
transaction to complete and safely propagate to the rest of the
infrastructure. I guess this varies, depending on the resource
allocation done for a specific app. Could someone from the AppEngine
development team give us some insight on this?

As I’ve mentioned before, I’m a Python newbie, so use the code above
at your risk :-)

Please post your comments here, so that they are all in one place.

unique integer IDs in Google datastore

code samples — Tags: , , — Panayotis @ 09:04

update: A good discussion on the topics mentioned in this article can be found here, please read it before using the code :-)

newbie code ahead! Use at your own risk :-)

One of the first problems I faced when trying to build an application in Google AppEngine, was the lack of something like a “unique, auto_increment” column type in the datastore. How do I maintain a unique numeric id in a way that is guarantied work even under heavy use, and concurrent requests?

Here is some code I came up with, that seems to work. I’m a python newbie, so please don’t hesitate to point out any mistakes!

What’s more, I’m just going through the Google AppEngine quirks, so I’m not aware of how to optimize the code or of any performance considerations implied by it. Once again, any comments are more than welcome!

class Idx(db.Model):
        name = db.StringProperty(required=True)
        count = db.IntegerProperty(required=True)

class Counter():
        """Unique counters for Google Datastore.
        Usage: c=Counter('hits').inc() will increase the counter 'hits' by 1 and return the new value.
        When your application is run for the first time, you should call the create(start_value) method."""
        def __init__(self, name):
                self.__name = name
                res = db.GqlQuery("SELECT * FROM Idx WHERE name = :1 LIMIT 1", self.__name).fetch(1)
                if (len(res)==0):
                        self.__status = 0
                else:
                        self.__status = 1
                        self.__key = res[0].key()

        def create(self, start_value=0):
                """This method is NOT "thread safe". Even though some testing is done,
                the developer is responsible make sure it is only called once for each counter.
                This should not be a problem, since it sould only be used during application installation.
                """

                res = db.GqlQuery("SELECT * FROM Idx WHERE name = :1 LIMIT 1", self.__name).fetch(1)
                if (len(res)==0):
                        C = Idx(name=self.__name, count=start_value)
                        self.__key = C.put()
                        self.__status = 1
                else:
                        raise ValueError, 'Counter: '+ self.__name +' already exists'

        def get(self):
                self.__check_sanity__()
                return db.get(self.__key).count

        def inc(self):
                self.__check_sanity__()
                db.run_in_transaction(self.__inc1__)
                return self.get()

        def __check_sanity__(self):
                if (self.__status==0):
                        raise ValueError, 'Counter: '+self.__name+' does not exist in Idx'
                else:
                        pass

        def __inc1__(self):
                obj = db.get(self.__key)
                obj.count += 1
                obj.put()

Suppose you have a Products class that looks like this

class Product(db.Model):
        Serial_ID = db.IntegerProperty(required=True)
        Name = db.TextProperty(required=True)

You should have an “installation page” that is only called once during your application installation and does something like this to create the counter Product_Serial_ID with initaial value 0.

s = Counter('Product_Serial_ID').create(0)

Calling the above code for a second time will raise an exception, but concurrent calls may have unexpected results.

Inserting a new product in the datastore:

P = Product(Serial_ID=Counter('Product_Serial_ID').inc(), Name='Product Name')
P.put()

Please note that if put() fails, the next time you try to insert the product you will get a new Product_Serial_ID. But at least you can be sure it’s unique and incremental :-)

AppEngine Datastore limitations

urlborg — Tags: , , — Panayotis @ 00:04

I’ve been trying to decide if moving urlBorg to Google App Engine is a good idea. The pros are obvious: scalability. There are many features I’ve wanted to implement for urlBorg but never did because I’m afraid that if it turns into a hit, my server will go down.

I mean, creating short URLs is a trivial thing. If you want to make a service that stands out, it has to be that it takes care of the little details in a much better way than the rest. And you have to be sure that your service will be able to scale.

So, moving urlBorg to Google App Engine should be a no brainer, right? Wrong.

My main issue is AppEngine Datastore.

The App Engine datastore is not a relational database. While the datastore interface has many of the same features of traditional databases, the datastore’s unique characteristics imply a different way of designing and managing data to take advantage of the ability to scale automatically.

So, forget about queries involving group functions like count(*), min(), max()… :-(

I wish they had some good examples on how to use the AppEngine Datastore to do data mining. How should/would a “web analytics” application be implemented using AppEngine for example?

Google CSE for Wordpress plugin v0.2

Google CSE plugin for WP — Tags: , , — Panayotis @ 20:11

I just released v0.2 of Google CSE for Wordpress.

The new version includes a widget making it very easy to integrate the search box in you sidebar.

Other changes include:
- cse_search_box_tags() that if used in single.php will pre-fill the search box with the post’s tags.
- automatically inserted in page headers. Of no use so far, but who knows? :-)

Google Reader stoped reporting subsriber numbers?

misc — Tags: , , — Panayotis @ 16:11

Suddenly, today I noticed that my feedburner stats don’t include Google Feedfetcher subscribers!!!

Is it a Feedburner (now Google) problem or a Google Reader problem? Is it a hickup, or a new policy?

UPDATE: others report the same problem too.

UPDATE #2 It was Google feedfetcher’s fault :-)

advertising a CSE XML file?

Google CSE plugin for WP — Tags: , , — Panayotis @ 21:10

Would you consider this
<link rel="cse" type="application/xml" href="http://vrypan.net/log/wp-content/plugins/google_cse/wp_cse.xml" title="Google CSE XML" />
to be the right way to advertise the existance of a Google CSE XML file? Any suggestions?

Google CSE v0.1 Wordpress plugin

Google CSE plugin for WP — Tags: , , , — Panayotis @ 17:09

This is version 0.1 alpha of “Google_CSE” a Wordpress plugin that creates a Google Custom Search Engine using a Wordpress blog and its blogroll. (Read, my slice of the web to get the idea behind it.)

After you install (read the included readme.txt file!), use < ? cse_search_box(); ?> in your templates to display the search box. Please note that it may take a couple of minutes before Google updates its caches and your CSE starts working -this may be the case for any changes too, like adding or removing blogs from your blogroll.

As I said, this is still alpha. I’m looking forward to your comments.

Download google_cse v0.1.

Google CSE in blogger.com

blogging, search engines — Tags: , , — Panayotis @ 20:09

Well, it turns out that my idea is old news :-)

Google Custom Search Blog explains to add a CSE box in blogger.com.

I’m working on the WP plugin.

my slice of the web

search engines — Tags: , , , — Panayotis @ 18:09

It’s not a novel idea, and it has been in my mind for years: there is a special set of URLs that define what I call my “personal slice of the web”. These are the sites that form my blogroll, the feeds I’m subscribed to, the links I tag using del.icio.us, the URLs included in my browser history, etc.

For each one of us, this “personal slice of the web” is much more important and much more familiar than the rest of the web. I have always thought that we should have better tools to manage this “slice”. We should be able to view and visualize it better, search it, share it, etc.

During the last couple of days I have been fooling arround with Google Custom Search Engine. A wordpress plugin that creates a CSE using your blogroll is almost ready, it just needs some polishing (you can see it here in action).

I have also written some code to create a Google CSE based on my del.icio.us links (a demo is here) but this needs more work -it’s just a couple of quick’n'dirty scripts.

Google Public Policy Blog

Uncategorized — Tags: — Panayotis @ 18:07

Google Public Policy Blog: Google’s views on goverment, policy and politics, is good reading.

I especially like their “open broadband manifesto” (”manifesto” is my addition).

However, what I would expect from google is some kind of commitment to user data accessibility. Are users free to move their data from service to service? Do they know exactly what kind of data a service is collecting and using? Should/Do users have the right to delete their own data? Who owns what? What is the equivelent of Open Source Software in Applications? What is the minimum functionality a Web API should expose to the world?

Com’ on Google, we expect you to be able to deal with difficult problems…

Next Page »
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2008 vrypan|net|log | powered by WordPress with Barecity