Standalone cache according to http requests

July 27th, 2008

Maybe this already exists but since my searches didn’t turn up anything, i thought i’d post this.

You have an app coded more or less rest style. Every post request implies there was a data change (-> cache becomes stale), every get request implies there was no change in the data (-> cache stays fresh). So you know that if a post request was made to domain.com/admin/news, the news cache becomes stale. I won’t go really deep here, in that if you change item 8 of the news table, you might only have 2 stale caches, the one that lists the news and the one that shows item 8 of the news table ( ie. domain.com/news and domain.com/news/8 or domain.com/news/title-of-article) and not every cache belonging to the news group but let’s keep it simple here.

I would like to know if there’s anything out there that parses the apache logs for post requests and if there was a post/put/delete in any url, according to a few configurable rules, it will automatically do a get to the correspondent url. For example, if a post was made to domain.com/admin/news/8 then it would be able to, upon parsing of the apache logs, do a get request to domain.com/news/8, generating the cache for the next user that comes along instead of waiting for the next user to generate a fresh cache - keeping him waiting . It would just increase the cache hit ratio per user. It would of course run as a cron job.

I like this solution because it really keeps the caching code (if cache exists, expire cache, use cache, etc) outside the app, becoming simply another layer, where it really should be.

I really think that this makes sense from a rest perspective so i suspect it’s already out there..

So anyone know of anything? Preferably in php, but python or ruby is ok too.

Thanx :=)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

data visualization for everyone

July 14th, 2008

Just found out about the wonderful project by IBM Many Eyes that alows anyone to create data visualizations without any programming knowledge (the usual suspects are flash/actionscript and processing) whatsoever.

In their own words:

Many Eyes is a bet on the power of human visual intelligence to find patterns. Our goal is to ‘democratize’ visualization and to enable a new social kind of data analysis.

On the usability side, the creators are to be congratulated on the simplicity of the whole interface. The user chooses a dataset - it is also possible to upload datasets - and then chooses a visualization type, eg. tagcloud, line graph, etc, previews, and publishes it! Really simple :=)

I actually created two visualizations, one on Obama’s speeches and another on Alice in Wonderland. Since the applets are interactive you can change the words and a new visualization will pop up.

note: may take a little while to load

Obama’s ‘We’

Alice in Wonderland - you, won’t

playing with alice’s playful dialogue

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

google’s keeping it simple

July 10th, 2008

It’s just sweet and revealing of google’s dedication to usability that google keeps a tab on the number of words the classic homepage has. If one goes in another goes out. So with an eye on privacy the current count is still 28 :=)

Just read it: 13,33,53..

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

htaccess - display cached html version

July 9th, 2008

There are several caching strategies in web application development:

  • database caching
  • fragment caching - parts of the rendered output are cached.
  • page caching - the whole output is cached. Useful when the page requires no authentication and has no personalization (eg. no Hello John if he’s logged in).

Page caching is the fastest method since the webserver can serve the html page directly allowing for the web app to be totally bypassed.

The idea is if a page is cached then the webserver detects that it exists in the cache folder and displays it, if the page doesn’t exist then the request continues to the web application which will render the page and save it to cache.

So how can the webserver detect this? Well, modrewrite to the rescue :=)

Here’s the snippet that makes it all possible:
[code]

IndexIgnore *
DirectoryIndex index.php index.html
Options +FollowSymLinks

RewriteEngine On

#if the request is domain.com/about, it will check if domain.com/cachehtml/about.html exists, if so displays it
RewriteCond %{DOCUMENT_ROOT}/cachehtml/$1.html -f
RewriteRule ^([a-z_-]+)$ cachehtml/$1.html [NC,QSA,L]

#goes one level deep
#if the request is domain.com/blog/hello-world, it will check if domain.com/cachehtml/blog/hello-world.html exists, if so displays it
RewriteCond %{DOCUMENT_ROOT}/cachehtml/$1/$2.html -f
RewriteRule ^([a-z_-]+)/([a-z_-]+)$ cachehtml/$1/$2.html [NC,QSA,L]

[/code]

The code above implies we are at the root. If not just add the folder name so that
RewriteCond %{DOCUMENT_ROOT}/cachehtml/$1.html -f
becomes
RewriteCond %{DOCUMENT_ROOT}/mysite/cachehtml/$1.html -f

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

http_load - another webserver peformance tester

June 7th, 2008

Http_load is another cool webserver performance tester that gives simple stats on how your webapp is performing.

How to install in OS X

  1. Download from http://www.acme.com/software/http_load/
  2. Open terminal, cd to the directory where the archive is and unzip
    $ tar xvzf http_load-12mar2006.tar.gz
  3. Move to that directory
    $ cd http_load-12mar2006
  4. Run
    $ make
  5. Run
    $ sudo make install

You’re ready! Open up a text editor and write down the website’s url you want to test (your own preferably), then cd to the directory where the .txt is and run
$ http_load -parallel 5 -fetches 100 name_of_file.txt
which means open 5 concurrent connections and fetch the webpage 100 times.

You’ll get something like this:

100 fetches, 5 max parallel, 1.34237e+07 bytes, in 15.842 seconds
134237 mean bytes/connection
6.31234 fetches/sec, 847351 bytes/sec
msecs/connect: 28.9069 mean, 75.011 max, 14.865 min
msecs/first-response: 435.84 mean, 2484.28 max, 96.082 min
93 bad byte counts
HTTP response codes:
code 200 — 100

I highlighted the important bits. At the moment the webserver is capable of handling 6 requests per second and has a mean average initial latency of 435 milliseconds.

Http_load tells you how your webapp is currently performing allowing you to test it under different conditions, basically it’s a benchmarking tool juts like httperf i covered here. The next step is optimization. Have a look at the 1st part of  Getting Rich with PHP 5 (what a crappy title) by rasmus lerdorf  for tools you can use to profile your code and some tips on optimization. In the example shown he goes from 17 reqs/sec to 1100 reqs/sec .

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

measuring webserver performance - httperf

June 7th, 2008

Httperf is a webserver performance tester. There are loads of performance testers out there (take a look here ) but i was up and running with httperf in no time. So here’s a quick get started guide

  1. Download the latest version from ftp://ftp.hpl.hp.com/pub/httperf/
  2. Install
    • $ tar xvzf httperf-0.9.0.tar.gz
    • $ cd httperf-0.9
    • $ ./configure
    • $ make
    • $ sudo make install

    Httperf is installed by default in /usr/local/bin/httperf. You then invoke httperf from the command line.

  3. Have a website to test (lol)
  4. Here’s a sample command
    $ httperf –server hostname –port 80 –ur /test.html –rate 150 –num-conn 27000 –num-call 1  –timeout 5
    Example: You have your site on localhost and for now just wanna test that.

    • $ httperf –server localhost –ur /about.html –num-conns 1000
      - test the page about.html in the localhost  server making 1000 concurrent connections
    • $ httperf  –-server=localhost –-wsess=12,8,2 –-rate=1 –-timeout=5
      • The –wsess sets the total number of sessions to generate, the number of calls per session, and the time (in seconds) that separates consecutive calls. If we use –wsess=12,8,2, we’re setting 12 sessions at five calls per session with two seconds between each call.
      • The –rate switch specifies the number of HTTP requests/second sent to the Web server — indicates the number of concurrent clients accessing the server. [Update] Actually when used together with –wsess it specifies the number of sessions and not of requests -> see comment by John Wilkinson below
      • The –timeout switch sets the maximum number of seconds to wait for a server response before httperf gives up. The default is forever so it’s good practice to set it just in case the server hangs (hangings your resources also). If this timeout expires, httperf considers the corresponding call to have failed.
      • The –num-conn sets how many total HTTP connections will be made during the test run - this is a cumulative number, so the higher it is, the longer the test runs
  5. Analyze the statistics printed to the console.
    There are six groups of statistics: overall results, results pertaining to the TCP connections, results for the requests that were sent, results for the replies that were received, CPU and network utilization figures, as well as a summary of the errors that occurred.
    Example printout:
    “Maximum connect burst length: 1
    Total: connections 100 requests 100 replies 100 test-duration 16.385 s

    Connection rate: 6.1 conn/s (163.8 ms/conn, <=1 concurrent connections)
    Connection time [ms]: min 135.5 avg 163.8 max 406.4 median 159.5 stddev 37.4
    Connection time [ms]: connect 19.0
    Connection length [replies/conn]: 1.000

    Request rate: 6.1 req/s (163.8 ms/req)
    Request size [B]: 64.0

    Reply rate [replies/s]: min 5.8 avg 6.1 max 6.2 stddev 0.2 (3 samples)
    Reply time [ms]: response 74.1 transfer 70.8
    Reply size [B]: header 514.0 content 15405.0 footer 1.0 (total 15920.0)
    Reply status: 1xx=0 2xx=100 3xx=0 4xx=0 5xx=0

    CPU time [s]: user 3.52 system 12.78 (user 21.5% system 78.0% total 99.5%)
    Net I/O: 95.3 KB/s (0.8*10^6 bps)

    Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
    Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

The connection rate, the request rate and the reply rate are the ones to look at. The better a website is performing (at the rate requested) the closer the connection and reply rate rate will be to the request rate specified in the initial command (–rate). Normally you do a series of tests, always increasing the request rate until you start to see that the reply and connection rate are no longer keeping up - that’s when you’ve hit your boundary, ie. how many requests per second your webapp is able to handle.

Also check autobench for automation of the testing process, here for an example of how httperf was used to benchmark the evolution of a project, an article from the source httperf—A Tool for Measuring Web Server Performance and finally this peepcode looks interesting.

Anyway, if i’ve missed any important information please say so in the comments.

[Update] Ted Bullock, one of the developers of httperf, was kind enough to point me to his quickstart guide, a six page long doc which has much more detailed information :=)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

the illuminatti

June 1st, 2008

the illuminatti

The illuminatti is an art project that explores the mesmerizing power everyday technology has on us. The modern deity is then technology. In his own words, Evan Baden the author explains it:

In Westernized cultures today, there is a generation that is growing up without the knowledge of what it is to be disconnected. The world in which we are growing up is always on. We are continuously plugged in, and linked up. We take this technology for granted. Not because we are ungrateful, but because we simply don’t know a world without it.

From our earliest memories, there has always been a way to connect with others, whether it is Myspace, Facebook, cell phones, e-mail, or instant messenger. And now, with the Internet, instant messaging, and e-mail in our pocket, right there with our phones, we can always feel as if we are part of a greater whole. These devices grace us with the ability to instantly connect to others, and at the same time, they isolate us from those with whom we are connected. They allow for great freedom, yet so often, we are chained to them. They have become part of who we are and how we identify ourselves. These devices ordain us with a wealth of knowledge and communication that would have been unbelievable a generation ago. More and more, we are bathed in a silent, soft, and heavenly blue glow. It is as if we carry divinity in our pockets and purses.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

Links on scalability

May 28th, 2008

This post is simply a collection of articles i particularly liked on scalabilty:

Scalability Best practices: Lessons From Ebay
ChasingSparks: Sharding with Cookie-Based Sessions
The Hitchhikers Guide to PHP Load Balancing - Port:EightyEight
The fav.or.it Blog » Fixing Twitter
Hueniverse: Scaling a Microblogging Service - Part I

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

update

May 26th, 2008

i started this blog at blog.whythelongroads.com and ported the few posts there and their comments to here. that blog will be going down in the short future.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description

Data portability and data access

April 6th, 2008

There’s a growing trend for the web to become a (the) platform to be consumed by webapplications themselves. Users can use one login for different sites, sync information between different web applications, import/export information to/from a site. All this is related to identity management and data portability, and on a second level to digital communities/social networks.

Skype proposes that community building applications must have a defined set of areas/scenarios upon which they can interact to facilitate the desired interoperability between them. Here they are (skype journal):

Social Stack’s Six Zones of Interoperability

* ID (Account lifecycles, Login)
* Sync (Profile, Contacts, Objects)
* Permission (Policy, Licensing)
* Find (People Search, Discovery, Gatekeepers)
* Action (Group Actions, Relationship Actions)
* Now (Alerting, Presence)

The idea is that there can be one single sign in (openid), there must be a standard way to sync information between applications - eg. if i export my contacts/friends from facebook to hi5 and then remove that friend in facebook does it get deleted in hi5?, how about the other way round? -, to find people between apps - If i have a friend in facebook and i also have a myspace account, could myspace alert me that my friend is in the network as well? should myspace do it? Maybe the friend wouldn’t like it to because he stores his work colleagues in facebook and his closer friends in myspace. These are questions that must be solved for dataportability to become a reality. Also check Robert Scoble’s post on this topic.

Dataportability also poses the question of the unecessary duplication of content around the web. If a have a blog in wordpress is it really necessary for myspace to store and sync with wordpress my blog posts? Isn’t that just making things difficult, ie. 2 servers now have the same data and must sync this between them, how about if a user changes a post in the wordpress blog and another user changes the same post in myspace, when syncing which version wins? Ouch.. version control management. These are real roadblocks to dataportability.

In some cases it may be simpler to just allow for data access. In the example above, why not just let myspace access the wordpress blog through an rss feed and every change made in wordpress immediately gets reflected in myspace? So instead of dataportability - taking my data from one place (exporting) to another place (importing) -, why not just simple data access?

Or maybe we need both.

Maybe dataportability and data access have different use cases? You want your data to be portable between competing/equal services and you also want to share it with different services. Eg. i want to export my photos from flickr and import them to imageshack and i also want to show my photos in my wordpress blog whether they’re stored in flickr or imageshack.

Ultimately it will be up to the users to decide if they want to take their data with them or just share it. Developers just need to work out a way to make this possible.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • del.icio.us
  • Mixx
  • Google
  • description