eighth, link: Calculation of link always we are most concerned about one of the one, now the mainstream search engine will calculate the links between web pages as an important part, look on the page which links to transfer the weight.
seventh, inverted index: the positive index mentioned above is not direct user ranking, ranking is the inverted index. Just imagine, if the forward index user ranking, when a user searches for a keyword, you need to search for all documents containing the keywords, then the workload will be great is not realistic. Search engines often put forward index library re constructed and converted to the inverted index structure of inverted index as a keyword corresponding to multiple files, when the user in a keyword search, will search the corresponding file in the words down, this processing speed is much faster, but also more easily achieved.
pretreatment is a part of the whole complex search engine, this paper from nine aspects respectively and discuss some basic knowledge, let everybody have an understanding of website design for the future and Shanghai dragon will help. Of course, these are only some knowledge to learn, if have a wrong place, also hope everybody. Well, to today’s text.
fifth, at any time to go heavy: at any time is more important part, because of the amount of information on the Internet is huge, with everyone sharing love itself, so many duplicate content. If the search engine not to heavy processing, so it will cause a lot of repeated crawling and included. The search engine used to weight method is to calculate the page keywords fingerprint, is typical of the MD5 algorithm, will select the best representative part of it from the page keywords in the calculation, in order to determine whether these articles are original. The fingerprint calculation often accurate to the end, so the general pseudo original is search engine will be found, it is easy to determine you are copying.
sixth, also referred to as positive index: positive index index, the spider on the web page extraction, segmentation, denoising and de emphasis, will be the theme of keywords can response. The search engine will put these words on behalf of the theme of the page consists of a set of simultaneous recording of each keyword in appear on the page number, format and frequency, and then put these into a set of storage index in the index, a huge library, each file corresponds to a ID, is a series of key words then, the search engine will continue to own adequate index library and directly pave the way for ranking.
the station from nine aspects of search engine pretreatment (1), were extracted from the text, Chinese segmentation, noise elimination, removing stop words four aspects and share the "index" pretreatment, believe that these basic articles will also help to everyone. Today, then the articles continue to go from heavy, positive index, inverted index, link calculation, special document processing five aspects and everyone to share.
The calculation of