Monday 8 February 2010

Liquida - Share a blog with us

How Liquida works

The backbone of Liquida is a wise mix of software technologies, hardware resources and a big dose of human creativity.

Since the beginning we decided to use opensource platforms and programming languages as we believe that free-licensed software and languages developed by open groups guarantee efficiency and reliability.

For these reasons, we opted for a LAMP platform: Linux operating system, Apache web server, MySQL database, and Python and Php scripting.

Herewith you find a simple overview of Liquida core processes, our IT team is always happy to share technical suggestions or receive feedback from you, so please don’t hesitate to contact us.

1. Blogs enter our directory

Blogs are added to Liquida either because they have been submitted online or because they have been selected by our staff. In both cases, every blog is assigned a score by our editorial staff. This score will, among other things, affect blog and blog posts positioning in our search results.
For blog directory management our editorial staff uses a backoffice interface written in Php, powered with an Apache server and a backend MySQL database.

2. New blogs contents are retrieved via RSS Feed

Our proprietary “patrol” algorithm - written in Python and executed on several machines - periodically goes through the full Liquida blogs database and starts searching for new content.

The RSS retrieval system tries to find XML content variations, acting in the least intrusive and most bandwidth-safe way, for example analyzing HTTP response headers without downloading unuseful duplicate information.

Only if our “patrol” finds that the XML content of a given blog has been modified, the “patrol” algorithm retrieves the entire RSS file and stores it in the backend MySQL database.

3. Posts optimization

The most complex part of the job is achieved by another proprietary system – also written in Python – that processes new posts that have been inserted in the backend MySQL database by the “patrol” software.

This service – higly parallelized in order to avoid any bottleneck effect – is structured in several dozen of stages that carry out specific processing operations.

The stages include analyzing the XML structure of RSS feeds (parsing), verifying which posts are already present in the archive, retrieving (by sending an HTTP request to the appropriate server) the page complete with images, heuristically analyzing the page template, automatically exctracting tags using open source natural language processing libraries, and the excluding obscene content.

The post is now transformed in a hash table and stored in the backend database.

4. Data indexing

Liquida indexing and search engine is based on the open source engine Lucene, interfaced via Solr (executed by Apache Tomcat).

A Python application, named mysql2solr, extracts each post information, processes the data, and passes it to Solr/Lucene for indexing.

Liquida has three core indexes: one for posts, that returns posts and images that match the search criteria, one for blogs and one for tags, which provides tags related to the search keys. Indexed contents can be searched on a chronological base and, most important, using content-relevant custom-built query functions.

5. The web site

Liquida front-end is developed in Php, using CodeIgniter as Php framework and MooTools as Javascript framework.

The site interfaces directly with Solr/Lucene, that returns information in serialized array that is easily elaborated through Php, without going through MySQL, which is only used in the backend.

Front-end Php platform and some Python applications are performance-optimized using Memcached as a caching system.

Liquida blog runs on Wordpress.

In order to optimize our developers team’s work the code is versioned with Subversion, using Eclipse as main editor.

Related Posts: - How does a blog enter Liquida?

Liquida - Share a blog with us

1 comment:

PHP Developer India said...

The backbone of Liquida is a wise mix of software technologies, hardware resources and a big dose of human creativity.