Main Page | Report this Page
 
Computers Forum Index  »  Computer Artificial Intelligence - Language  »  Faster Better Cheaper Search Engines...
Page 1 of 1    

Faster Better Cheaper Search Engines...

Author Message
John...
Posted: Mon Oct 26, 2009 12:44 am
Guest
Searching for documents and other items on the Web or computers is
often tedious and time consuming. Time is money. Highly paid
professionals spend hours, days, and even longer searching for
information on the Web or computers. Most search today is done using
key word and phrase matching, often combined with various ranking
schemes for the search results. Occasionally more advanced methods
such as logical queries, e.g. search for “rocket scientist” and NOT
“space”, and regular expressions are used. All of these methods have
significant limitations and often require lengthy human review and
further manual searching of the search results.

The dream search engine would search by topic, by the detailed content
of the items searched, ideally finding the desired information
immediately. Actual understanding of text remains a unfulfilled
promise of artificial intelligence. Statistical language processing
can achieve a degree of searching by topic. This article introduces
the basic concepts and mathematics of statistical language processing
and its applications to search. It gives a brief introduction and
overview of more advanced techniques in statistical language
processing as applied to search. It also includes sample Ruby code
illustrating some simple statistical language processing methods.

http://math-blog.com/2009/10/25/faster-better-cheaper-search-engines/
 
Ian Parker...
Posted: Mon Oct 26, 2009 11:29 am
Guest
It is easy to say these thing. In fact the modern search engine is
extremely sophisticated in what it is trying to do.
Most search engines these days use LSI (Latent Semantic Indexing).
This presents each web page as being a vector. Some remarkable
associations between websites that have got similar vectors.

http://chris.ikit.org/ksv2.pdf

is very impressive. It should be pointed out that LSI scanning is CPU
intensive. OK once a page has been done it has been done.
One thing I would like Google to do is to use vectors when searching
from within a document you are writing. It does not appear to do this.

- Ian Parker


On Oct 26, 12:44 am, John <jmcgowa... at (no spam) gmail.com> wrote:
Quote:
Searching for documents and other items on the Web or computers is
often tedious and time consuming. Time is money. Highly paid
professionals spend hours, days, and even longer searching for
information on the Web or computers. Most search today is done using
key word and phrase matching, often combined with various ranking
schemes for the search results. Occasionally more advanced methods
such as logical queries, e.g. search for “rocket scientist” and NOT
“space”, and regular expressions are used. All of these methods have
significant limitations and often require lengthy human review and
further manual searching of the search results.

The dream search engine would search by topic, by the detailed content
of the items searched, ideally finding the desired information
immediately. Actual understanding of text remains a unfulfilled
promise of artificial intelligence. Statistical language processing
can achieve a degree of searching by topic. This article introduces
the basic concepts and mathematics of statistical language processing
and its applications to search. It gives a brief introduction and
overview of more advanced techniques in statistical language
processing as applied to search. It also includes sample Ruby code
illustrating some simple statistical language processing methods.

http://math-blog.com/2009/10/25/faster-better-cheaper-search-engines/
 
Ted Dunning...
Posted: Wed Nov 04, 2009 7:57 pm
Guest
I hate to be negative, but ...

On Oct 26, 3:29 am, Ian Parker <ianpark... at (no spam) gmail.com> wrote:
Quote:
It is easy to say these thing. In fact the modern search engine is
extremely sophisticated in what it is trying to do.
Most search engines these days use LSI (Latent Semantic Indexing).

This is just plain silly. In fact, very few search engines use LSI
outside of research. Even fewer search engines in production use LSI
directly. A very few engines use some form of random indexing (which
is similar). Off-hand, I can only think of non-search production
applications that use this form of comparison (essay scoring, a (very)
little bit of fraud modeling, some recommendation engines, perhaps one
or two other applications).

Quote:
This presents each web page as being a vector. Some remarkable
associations between websites that have got similar vectors.

Moderately interesting is what I would say rather than "remarkable".

Quote:
.... It should be pointed out that LSI scanning is CPU
intensive. OK once a page has been done it has been done.

The first is true, the second is not.

Searching with LSI is exactly proportional to corpus size and is
usually bottle-necked by memory bandwidth and secondarily by.
Searching using conventional techniques is sub-linear in corpus size
when you start getting really large corpora. The cost of LSI is
prohibitive for most large search engines on several axes.

In addition, LSI is typically best in recall while modern search
applications are (mostly) dominated by considerations of first page
precision. This makes LSI a very bad match to (most) modern needs.
 
Ian Parker...
Posted: Thu Nov 05, 2009 2:47 pm
Guest
On 4 Nov, 19:57, Ted Dunning <ted.dunn... at (no spam) gmail.com> wrote:
Quote:
I hate to be negative, but ...

On Oct 26, 3:29 am, Ian Parker <ianpark... at (no spam) gmail.com> wrote:

It is easy to say these thing. In fact the modern search engine is
extremely sophisticated in what it is trying to do.
Most search engines these days use LSI (Latent Semantic Indexing).

This is just plain silly.  In fact, very few search engines use LSI
outside of research.  Even fewer search engines in production use LSI
directly.  A very few engines use some form of random indexing (which
is similar).  Off-hand, I can only think of non-search production
applications that use this form of comparison (essay scoring, a (very)
little bit of fraud modeling, some recommendation engines, perhaps one
or two other applications).

This presents each web page as being a vector. Some remarkable
associations between websites that have got similar vectors.

Moderately interesting is what I would say rather than "remarkable".

.... It should be pointed out that LSI scanning is CPU
intensive. OK once a page has been done it has been done.

The first is true, the second is not.

Searching with LSI is exactly proportional to corpus size and is
usually bottle-necked by memory bandwidth and secondarily by.
Searching using conventional techniques is sub-linear in corpus size
when you start getting really large corpora.  The cost of LSI is
prohibitive for most large search engines on several axes.

In addition, LSI is typically best in recall while modern search
applications are (mostly) dominated by considerations of first page
precision.  This makes LSI a very bad match to (most) modern needs.

I have a Google alert "Latent Semantic Analysis", and loads of
articles come up describing how to optinise your search for Google's
new techniques. It would seem that they are all under an illusion.

It is hard to see how Web 3.0 is ever going to work without some form
of LSA being used to produce precise word meanings.


- Ian Parker
 
Ted Dunning...
Posted: Fri Nov 06, 2009 5:32 pm
Guest
On Nov 5, 6:47 am, Ian Parker <ianpark... at (no spam) gmail.com> wrote:
Quote:
In addition, LSI is typically best in recall while modern search
applications are (mostly) dominated by considerations of first page
precision.  This makes LSI a very bad match to (most) modern needs.

I have a Google alert "Latent Semantic Analysis", and loads of
articles come up describing how to optinise your search for Google's
new techniques. It would seem that they are all under an illusion.

Well, that would not be the first time that the SEO community have
gone all a-twitter about rumors that have nothing to do with the
reality of how search engines work.

My own working approximation is that SEO "expert" knowledge of search
technology is zero. I have only very rarely seen any counter-
evidence.

LSI and LSA are very particular technical terms that refer to spectral
decompositions of occurrence patterns. Google has done work in
probabilistic LSI (I believe that Hoffman works there now), but I am
pretty sure (without direct knowledge of the code) that the techniques
actually in production uses term expansion instead of dimensionality
reduction, and that the term expansion is done primarily at index
time.

Quote:

It is hard to see how Web 3.0 is ever going to work without some form
of LSA being used to produce precise word meanings.

The point of LSA is to get just the opposite of "precise word
meanings".
 
Ian Parker...
Posted: Fri Nov 06, 2009 8:58 pm
Guest
On 6 Nov, 17:32, Ted Dunning <ted.dunn... at (no spam) gmail.com> wrote:

Quote:
The point of LSA is to get just the opposite of "precise word
meanings".

I disagree, although we may be talking at cross purposes. There is a
set of precise words that is a (possibly quite small) subset of total
words, but which refer to a unique precise concept. Sometimes, as is
the case with Blue Giant, we have other meanings. As I said "Blue
Giant" is a music group. We need a way of distinguishing between music
groups and massive main sequence stars. LSA is the "kneejerk" way of
doing this.


- Ian Parker
 
 
Page 1 of 1    
All times are GMT
The time now is Sun Nov 22, 2009 9:23 am