Notes on Web Development
From Ggl's wiki
Contents |
Synchronous vs Asynchronous
HTTP is request/response synchronous and stateless protocol. Most web site were historically development as synchronous application with sessions builds on top of HTTP thanks to cookies.
However synchronous means all the work should be done before the server replies to the client. Scheduled tasks like cron jobs are a way to work in the background. However the are mostly done without context knowledge and independently of user interaction.
Behind the curtain an application can work asynchronously in a producer/consumer way. Depending of the data and the operations different patterns may be used. Simple producer/consumer with queues or map/reduce for example.
Simple producer/consumer with a queue works by having the producer to push requests on the queue. Then the consumer pop the request, process it, do the work and finally reply to the producer or another process. We can see this simple architecture at a high level or digging lower.
IPC
How the processes will communicate?
There are plenty of Serialization format and RPC protocols:
- Thrift
- Protocol Buffers
- XML-RPC
- JSON-RPC
- Bert-RPC
- Avro
On the transport side, I like ZeroMQ because it is simple and emphasizes efficiency. Bert-RPC and Avro are the best candidates for the serialization/RPC part because they are simple and flexible.
We can split the two:
- Serialization formats
- Transport protocol (wire format and communication)
Then we can serialize in Bert or JSON and send messages over ZeroMQ. A convenient library would provide a API to easily serialize and build messages.
Templating
It may feed lots of debates. What features does templating should provide? The most basic one is variables. It is the core role of templating :). Then what kind of variables ? Lists, arrays, associative maps... Does it provide operations on data?
We might think that templating should only deal with data representation. In this case, it does not provide control structures (like loops or conditional statements). Take the example of a page that display a different menu either the user is authenticated or not. How to do this?
A straight forward technique is to use a conditional statement:
<% if user.authenticated %> <li>user</li> <li>signout</li> <% else %> <li>signin</li> <% endif %>
Without conditional statement, we include the corresponding template instead: defaut_menu.template or userauthenticated_menu.template. Then we associate a state with a template.
menu_template user
|(isAuthenticated user) = Template "userauthenticated_menu.template"
| _ = Template "default_menu.template"
And finally we render a template that is the composition of its sub-templates.
MVC vs Components
The Model-View-Controller pattern is one of the most widely adopted design in web framework. Another approach, implemented by the SeaSide framework, is component-based with continuation.
In the MVC pattern: - the model represents the data. It is commonly a abstraction of the database layer. However it does not enforce a database type. The database might be relational, object oriented, simple key/value, etc... - the view presents the data to the client. It takes the data and how they should be presented, then translate the data to the expected representation. For example, data may be represented by a tree of objects. The view transforms the tree into a HTML file. - the controller is the central part that gets the data from models and process them. Once the data are processed and can be passed to the view.
A component-based with continuation framework manipulates components with a local state and actions the component applies to itself. You can see that like a widget. The continuation allows to run through the different states of the component despite the stateless nature of HTTP.
Both processes have their advantages. They also have in common the need to access data and process them efficiently. One of the main issue comes from blocking and sequential processing.
Parallel processing
Why?
Imagine you have a page with multiple widgets. As a basic example, let take a blog. We want to display the articles and their summary, the list of article titles, RSS feeds, a tag cloud and tweets.
First there are data in common. The articles and their summary, shows the articles. And the RSS feeds are almost the same as the articles in HTML. The tag cloud may be a heavy query on a database. Finally tweets come from an external site and we don't control their available nor the time to get the result.
Computing unit
Then we would like to split each widget, group widgets that share data and process each of them separately.
Each widget needs a simple computing unit. A coroutine seems better than a thread, because with cooperative multitasking we choose when we schedule the tasks. Furthermore we could see a widget process as the continuation and choose to interrupt it when needed.
How to group the data?
If we wait before data are actually retrieved, we must not use the data, or if we need them, only symbolically. This part is really interesting. Like a mathematical proof with letters only, we may not need the values. We build a computation with symbolic data and execute it at the end.
How to retrieve data in parallel?
A normalized data model helps to ensure ACID properties. However we don't need them all the time. Sometimes, it is just more convenient to be able to pass JOIN queries to the database. However it means we relying in only one data source.
We should find the data that don't need to be normalized e.g. the data that don't need consistency. ACID properties where designed in an OLTP context (in the end of the 60's and in the 70's to answer financial and industrial needs). Do you really need them in your web application? To continue with our simple blog application, do we need consistent data about our tag cloud? At least we should ensure the tags exists but is it critical to have an up-to-date count (or another criterion for the tag size)? It's the same of the tweets, there is no problem to miss the tweet sent 10s ago.
Multitasking was implemented to allow task to run while other tasks are waiting for I/O (input/ouput) to complete. Here it's exactly the same. It takes time to get the data. we don't want to loose that time. So, while we wait let another process keep the cpu busy! And if it's possible, share the data that can be shared!

