Developers of web search engines and data mining tools expend vast
sums attempting to find the most efficient ways for users to search. Mighty
algorithms travail indexes with great speed and seemingly great efficacy,
narrowing a keyword search to a "short" list of results within seconds. However,
the "No Free Lunch Theorem" holds that there simply is no algorithm that can
perform better than a random selection on the set of all problems. In other
words, no matter how hard those developers try they will simply never beat a
randomly selected set of "results".
In the present paper, Palotai, Farkas, and Lörincz seek to understand how the
topological structure of the environment influences algorithmic efficiency and
whether or not there might be a "free" lunch, after all. They compare the
performance of algorithms in unearthing news from a large news website, using
basic learning techniques - selective learning, reinforcement learning, and
their combinations, in random, scale-free, and scale-free small world (SFSW)
environments. Their empirical results suggest that selective learning is the
most efficient in SFSW topology, but in non-small world topologies, combining
selective and reinforcement learning algorithms gave the best results.
Evolving systems, both natural and artificial (like the Web), exhibit scale-free
or scale-free small world properties. Previous researchers have shown that there
is no performance difference between optimization or search algorithms if the
algorithms are tested on every possible problem. This implies that differences
in the performance of specific algorithms are simply a result of specific
properties of the problem being looked at. By uncovering these properties, it
should then be possible to develop optimized search approaches, despite the no
free lunch theorem. The structure of a database or index is an important
property and others have already demonstrated that an evolutionary algorithm
that adapts to the results it obtains performs better in artificial scale-free
environments than in lattice, small world, or random environments. Knowing this
could be exploited usefully in developing more effective data mining and web
search tools.
Everyday millions of users carry out millions of searches on the Web, looking,
on the whole, for novel information. The search itself therefore bears its own
natural reward function, explains Lörincz, the number of novel web pages or
documents that the algorithms can find. The fact that the researchers did not,
and could not, specify the temporal and structural details of this reward system
is perhaps the most important aspect of the current paper. With this in mind the
team set about designing an experiment to compare algorithms approaches.
Their controllable experiment was designed so that the complexity of experiments
and the number of designer specifiable parameters were minimal and looked at the
structure of the search environment, the algorithms, and the fitness values.
They chose the web as the most appropriate data set as it is a large source of
rapidly changing data and because of its organic evolution has an almost
entirely scale-free small world (SFSW) structure. They chose a large news
website containing time stamped entries and downloaded a portion of this with
which to experiment. They could thus preserve the temporal structure of the
rewards, and at the same time modify the underlying connectivity structure so
that the way in which novel information can be found could be changed to test
different search methods. They used a group of computer agents - Web crawlers.
These travel from link to link and forage for new, not-yet-seen information. The
agents used the simplest possible algorithms: selective learning concerned the
selection and memorization of good links. They also used a simple version of
reinforcement learning (RL). RL was used alone and in conjunction with selective
learning in separate experiments.
There are two types of agents - foragers and reinforcing agents (RA). Foragers
crawl the web and send back the addresses (URLs, uniform resource locators) of
the selected documents to the reinforcing agent. The RA is a simple machine that
sends out foragers one after the other an equal number of times. It acts as a
central agent that can determine which forager finds new information first and
then reinforces that forager. The RA sends reinforcements to the foragers based
on the received URLs. Lörincz and his colleagues have employed a fleet of
foragers to study the competition among individual foragers. The foragers then
compete with each other to find the most relevant documents and so efficiently
collect new relevant documents without interacting directly with each other.
Foragers adapt based on reinforcements using their learning algorithms. The
central agent has been used to model real life, where there is no actual agent,
but where food that is eaten in any given area by one forager is obviously then
unavailable to foragers that follow, so they are not reinforced by searching the
same region.
The researchers then applied a set of constraints on finding the minimal set of
algorithms. As such, the algorithms should allow the identification of
unimportant parameters, support the specialization of individual foragers, allow
the melding of evolutionary learning and individual learning, and minimize
communication as much as possible. The team deployed sixteen foragers to search
the CNN website for relevant documents; that is pages that are less than 24
hours old and that have not been previously marked as seen. In this the download
part of the experiment, the goal was to assimilate a database that is unbiased
from the point of view of later modifications to its structure. The first eight
forages used both selection based and reinforcement learning. The other eight
foragers used only selection based learning. The selection based algorithm,
known as the weblog update algorithm, learns the best starting URLs of searches
from which the forager can periodically restart its search. The reinforcement
learning based URL ordering update algorithm modifies the crawling directions of
the foragers, i.e., the type of food (here, topic) they like. Once foraging was
complete the team investigated the link structure of returned documents and
found it to obey a power law function, in other words it was scale free.
The next step was to repeat the experiment but in a simulation mode rather than
on the web. For this, the forager architecture was implemented in Matlab as if
it were running on the same computer used in the "real" experiments. They
conducted simulations with four different kinds of foragers in each environment:
WR foragers used both the weblog update and the reinforcement learning based URL
ordering update algorithms. WL foragers used only the weblog update algorithm
without URL ordering update; these foragers each had a different variable weight
factor. RL foragers used only the reinforcement learning based URL ordering
update algorithm without the weblog update algorithm. Finally, fix foragers did
not use the weblog update and the reinforcement learning based URL ordering
update algorithms. These foragers had fixed starting URLs and fixed weights.
The researchers found that the efficiency of the algorithms depends strongly on
the weight vectors associated with each forager in the simulation. In contrast,
the number of starting points has little impact on efficiency. Two foragers with
twenty different starting points were comparable in efficiency to sixteen
foragers using 160 different starting points. The number of starting points and
the number of foragers are less important than the weight vectors in the
simulations. Larger numbers of crawlers and having more than one computer can
only change the situation weakly, which gives the researchers confidence that
their studies of structural dependencies would be valid.
They found that freshness and age values of different foragers change in the
SFSW environment while for the other three environments the values are almost
the same for the different type of foragers. RL and Fix foragers can get trapped
in clustered environments as they cannot find outbound links in less relevant
documents. The weblog algorithm provides a way to reach the best clusters that
have been found and so escape the worst cluster. As such, the RL and Fix
foragers in the SFSW environment perform worse than WL and WR foragers.
Finding new documents is hardest in the SFRandom environment for all foragers as
incoming link degree distribution is exponential which means that there are
relatively more small degree pages than in the other three scale-free
environments. It is thus more difficult for foragers to reach those pages that
have lots of links to many pages. Overall, finding relevant documents is easiest
and download efficiency is best in the SFSW environment for all foragers. This
is because of the high number of found relevant documents compared to the other
three environments. However, finding new URLs is hardest in the SFSW environment
because of its clustered nature. It seems that selection fits the SFSW structure
and vice versa, which the researchers suggest could explain the abundance of
emergent SFSW structures in nature and perhaps even the organic nature of the
web itself.
This finding can be understood in terms of the no free lunch theorem by
considering it in reverse. If we find that a particular algorithm is more
efficient than random search or than any other algorithm, then this winning
algorithm "knows" (or fits) the underlying problem better than the others. The
researchers conclude from this that highly clustered SFSW structures and simple
selective learning algorithms intrinsically match each other.
Consider data mining on the web. If the goal is to harvest the largest number of
novel documents within the shortest time interval then, according to this study,
the best approach is to use the selective weblog algorithm, provided that the
domain of interest is SFSW. The scenario may be very different, however, if the
goal is simply to download all novel documents. This is an intriguing problem
that calls for the examination of the graphs of the not-downloaded documents for
the different algorithms and, possibly, for the examination of the virtual graph
between the foragers, explains Lörincz. In this virtual graph, two foragers are
connected to each other if they have sent back the same document. The connection
is strong (they are close to each other) if they send back a large number of
identical documents. Foraging may be optimized if the foragers learn which
foragers are close to them and share the knowledge about their recent paths.
Distributed communication between close foragers may improve data mining without
too much communication overhead. This issue is the subject of the current
research efforts of Lörincz and colleagues.
This article originally accompanied the paper by
Lörincz
et al. in Karger's journal Complexus for which David Bradley is an editorial
board member.