| There are today search engine and internet marketing | | | | the cosine angles for all terms against the query |
| services, in fact a new industry has materialised to | | | | vector, this is expressed as a vector corresponding to |
| exploit the fear of low search rankings.This is not a | | | | the sought column and therefore the document we |
| new trend, back when simply resubmitting your | | | | are after, all we need do is present this to the user, |
| website to the engines resulted in keeping your site at | | | | right, well....The issue is that a search engine needs to |
| the top of the index, there was an accompanying | | | | generate a linear index, ie convert the vectors |
| boom in resubmitting "companies", as we know, these | | | | corresponding to the minimum cosine angles into a |
| were just men in back bedrooms with a host of CGI | | | | human readable format, and until such time as |
| and Perl submitting scripts and a timetable.Search | | | | someone thinks of a better way to do it, all engines |
| Engine optimisation or "SEO", is the latest incarnation of | | | | output lists, like your shopping list, it has a start, a middle |
| this bedroom profiteering, the important difference is | | | | and an end, therein lies the problem, how to order the |
| that now the webmaster's are not just passively | | | | list!The hypothesis seems simple, ordering information |
| involved but are being forced to adopt totally artificial | | | | that might look chaotic at first, using the fact that |
| and unsocial practices that ultimately serve only to help | | | | closely associated documents tend to be relevant to |
| damage the Internet!SEO is supposedly the | | | | similar requests. However, the internet (being a scale |
| methodology and processes related to designing | | | | free network) is so vast that it is not possible to |
| search engine "friendly" web content, the basic | | | | present a chosen feature space that represents the x |
| premise is something like "If I follow all the engines | | | | closest documents to the convergence point in a given |
| formatting and connectivity criteria, then my website | | | | cluster from the common Euclidean distance. This is |
| will rank higher then a comparable website that does | | | | what should then be presented to the user in a more |
| not".All other things being equal, this seems quite | | | | intelligible (semantic) display.The engines could just |
| positive given that the quality of a search engines | | | | present the returns as produced by the matching |
| database (index) directly effects its output; then | | | | algorithms after decomposition, because the grouping |
| webmaster's optimising their content so that search | | | | generated using probabilistic/fuzzy patterns directly |
| engines can correctly categorise the internet should | | | | from the cluster might belong to more than one class, |
| logically improve the speed and quality of "the | | | | but the strength (degree of membership) value |
| crawl".SEO then, logically, should be good for the | | | | measured on a scale; using probability on a [0,1] interval, |
| search providers, being able to maintain an efficient | | | | is quite adequate.The reason decomposition in singular |
| index, this should use less raw processing power, | | | | values works for ordering is related to the fact that |
| require less equipment and thus less energy; this must | | | | the occurrence of two terms (say tomato and potato) |
| also be good for the users, being able to quickly and | | | | is very high is reflected in the term-by-document |
| intuitively find what they want from a reliable source. | | | | matrix by showing that only x of the n terms are used |
| Sounds reasonable right?Well that's the happy version. | | | | very frequently.The idea is that since the term say |
| The fact is that initially this may be true, you may gain | | | | pepper is used/mentioned very little, then its axis |
| a short term advantage, but once we have all | | | | dimension does not affect much the search space, |
| optimised our content for analysis and (in so doing) | | | | making it flat and relevant only in the other two |
| ignored our users; We will then be back to where we | | | | dimensionsHowever the engine's demonic creators |
| started, and the search providers will just think up | | | | can't do this because they are still essentially using an |
| some even more ridiculous "laws" by which to "judge" | | | | inverted file structure, but they still want absolute |
| us by, and like sheep we will all do that as well, thus | | | | correctness in their indexes and returned results which |
| the causal paradox is perpetuated and the users feel | | | | means trouble, because this assumes your index is |
| abused!Even this is a vast oversimplification, the true | | | | perfect, incapable of being manipulated and that you |
| nature of SEO is a lot more complicated; The heart of | | | | can somehow order the returns in a meaningful |
| the problem and the real issue here is related to the | | | | way!So the returned results can't generally represent |
| search providers task, which is to strip mine the | | | | the documents that match semantically, we now need |
| information junk yard otherwise known as the Internet, | | | | to account for some subjective quantities, that can not |
| it may be full of interesting stuff but also plenty of | | | | be derived directly from the corpora, they attempt to |
| garbage and they need to devise intelligent techniques | | | | deal with this by a cocktail of criteria that rank the |
| to mine the interesting stuff!The current "solution" is | | | | returns in such a way as its more likely that the |
| literally for the search engines to use their hegemonic | | | | "better" results are closer to the top of the list.There |
| standing to bully the webmaster's into organising their | | | | are many ways of doing this, the current trend is to |
| work in ways that have the primary effect of allowing | | | | use inference about the quality of web sites were |
| quick "analysis" so they can categorise the website, | | | | possible because such quantities are beyond the direct |
| but this has the secondary effect of requiring content | | | | control of the content creators and the |
| to be designed "for" analysis, which typically translates | | | | webmaster's.PageRank provides a more sophisticated |
| to highly distributed connectivity, ie the website being | | | | way of citation counting but this is embodied in the |
| effectively divided into "micro sites", which makes the | | | | consept of link analysis, using a relative value of |
| maintenance of links and content more | | | | importance for a page measured based on the |
| troublesome!This is not necessarily a bad thing, most of | | | | average number of citations per referance |
| these imposed linking and design methodologies are | | | | item.PageRank is currently one of the main ways to |
| often positive and beneficial for a lot of subjects. My | | | | determine who gets into the top of the listings, but |
| problem is that this is unilaterally enforced and it is this | | | | soon this will all become irrelevant when the engines |
| type of issue that is generating all the money for the | | | | stop using inverted file structures, because they can |
| SEO boys.However this will soon be of no | | | | just use the grouping generated using probabilistic |
| consequence. To understand the problem with this | | | | fuzzy patterns resulting from the convergence point in |
| type of SEO operation, it is necessary to think about | | | | a given cluster from the common Euclidean |
| how we can approximate and simulate the human | | | | distance.When the changeover from inverted file |
| process of mining information and knowledge.Let us | | | | structures occurs, there will be two direct |
| assume we have set our Crawlers to work, | | | | consequences:The corpora will be capable of vastly |
| automatically indexing pages (at random, looking at | | | | more representative and more detailed data then is |
| previous indexing and guided by user requests); we | | | | Currently possible.The corpora will no longer be |
| then format the resulting text: ASCII is usually used and | | | | indexed as is currently done, they will embody |
| validation follows this, search engines tend to ignore | | | | semantic meaning and value, where some subjective |
| some tags and make use of good ones that help | | | | quantities can be derived directly from the corpora |
| identify the content. At this point we would have | | | | without the need for cocktails or totally artificial rules. |
| reduced the Internet to a corporation, ie the collection | | | | The effect is that corpora will be more accurate and |
| of all HTML documents about no particular subject.We | | | | incapable of manipulation, thus variations of SEO that |
| then would set about item normalisation, ie identification | | | | involve indirect manipulation of the index will become |
| of tokens (words), characterisation of tokens (tagging | | | | pointless overnight.It is worth noting that the search |
| meaning to words), and finally running stemming | | | | providers are becoming increasingly pessimistic about |
| algorithms to remove suffixes (and/or prefixes) to | | | | website promotion in all forms, they currently penalise |
| derive the final database of terms; this can be | | | | many things that can effect the results such as |
| efficiently and compactly represented in lower term | | | | duplicated content (which can be perfectly legitimate), |
| dimensional spaces, (Goggle are still essentially using | | | | and satellite sites, ie one webmaster interlinking |
| inverted file structures).Imagine each document of a | | | | seemingly separate but highly relevant website's.They |
| corpus as a point ie a term in an N dimensional space, | | | | may well start penalising webmaster's that promote |
| here the literal word matching type search is lost, but | | | | their website's through articles they submit for third |
| we acquire more of a semantic flavour, where closely | | | | party distribution, as they do for people that post their |
| related information can be grouped in to clusters of | | | | sites information to bulletin boards!Being banned from |
| documents bearing similarities, however N dimensional | | | | the top search engines can effectively destroy your |
| vector spaces are of no help to the users.After | | | | business, if not directly through loss of visibility then |
| applying our algorithms to the corpora, we get a term | | | | indirectly in that people tend to judge you on weather |
| by document matrix, where terms and documents are | | | | your are organised enough to be listed !The criteria are |
| represented by vectors, a query can also be | | | | continually changing, as the amoral SOE boys attempt |
| represented by a vector. So we have a query and | | | | to pervert the resultes, these "laws" are not always |
| our corpora (represented as vectors, bo!th having the | | | | clear and there are no appeals, where we are all |
| same dimensions), we can now start matching the | | | | subject to the providers up ending a drum then |
| query against all the available documents using the | | | | dispensing swift and hard "judgements", that can doom |
| cosine angle between these two vectors.But we now | | | | us at any time!The part that erks the most is that as |
| have a new artificial "problem"; we know the general | | | | the indexes converge, (goggle's index is used directly |
| answer to the question "which website's best match | | | | by 2 of the 3 top engines and 5 others indirectly use it |
| my search terms", this information now exists in our | | | | for their rankings) a bann by anyone of these engines |
| mathematical object, at a high level of abstraction, ie | | | | is enforced by them all. |