Current search engines such as Google and Yahoo! are prevalent for searching the Web. Search in dynamic pages, however, is either inexistent or far from perfect. AJAX and Rich Internet Application are such applications. They are increasingly frequent on the Web (in YouTube, Amazon, GMail, Yahoo!Mail) or mobile devices and are offering a high degree of interactivity to the user, by seamlessly loading content from the server without the need to refresh the page. Current search engines cannot correctly index AJAX applications. This produces false positives and false negatives, because search engines do not understand the application logic that loads content dynamically. Crawling an AJAX application is a difficult problem. Since the user invokes events on the page, crawling must identify the different application states generated by the client-side logic.

This demo sets the stage for this new type of search and shows that a search engine for AJAX can be built. Among others, the challenges, as opposed to traditional search engines, are: automatically identifying states by triggering events, efficiently crawling application states, avoiding the invocation of potentially very numerous events, scalability in the number of events, duplicate elimination of states, result presentation and aggregation, ranking. The demo presents the AJAX search engine: crawler, indexer and query processor, applied on a real application and showcases challenges and solutions.

Currently, Google and other search engines are the usual way to search the World Wide Web. A big part of the Web pages can be indexed and retrieved with good quality. However, the Web is changing. More and more applications are dynamic by nature and include a lot of client-side and client-server interactivity: Javascript applications, AJAX applications, Rich Internet Applications are already handling much of the data on the web and on portable devices, provinding a high degree of interactivity to the user. Current search engines fail to index these applications correctly. AJAX Applications run partly on the client, embedding a lot of functionality in a single page, under the same URL. Current search engines do not index these pages since they do not understand the application logic: the application has states, and events cause transitions between states. Furthermore, all states are identified by the single URL of the page, and this is incompatible with the traditional search model of the web. Current search engines will produce false positives by considering all applications as a single page, or false negatives by ignoring the parts exposed only through client-side scripting. Currently, search in AJAX applications is done by custom search engines developed by the application provider or by exposing the data to the traditional search engines, based on agreements. If search is possible, it is hard-coded and expensive to implement. Small providers cannot afford this luxury.

We address this problem: we implement AJAXSearch: an AJAX-aware search engine. Just as a traditional search engine, it contains a crawler, indexer and query processor, as shown in Figure 1, but the components are are adapted to AJAX. Our first implementation focuses on AJAX sites without user input, thus avoiding the already studied domain of the “hidden web”.

Technical challenges: The challenges to address when building an AJAX search engine are:
(i) generating application states by invoking numerous events.
(ii) identifying duplicate states.
(iii) maintaining context information; reconstructing application states.
(iv) ranking results.
(v) state explosion in case of too numerous events.

The demo shows each component of the AJAX search architecture. The main aspects are highlighted for two applications: a news reader and Yahoo!Mail, a well-known AJAX application.

Download pdf AJAXSearch: Crawling, Indexing and Searching Web 2.0 Applications