A. The Yandex Leak: 7 Link Related Factors (of 100s)
A former Yandex employee has leaked the source code of the search engine and other services. This provides exciting insights into the inner workings of the search engine: ranking factors, weightings, and more. Yandex is the search engine market leader in Russia and fifth in the world by page views. While Yandex is not Google, the essential workings of search engines are comparable. The following findings are not necessarily directly applicable to Google, but they do provide a fascinating insight. There is an extensive list of 1,922 ranking factors shared and analyzed by many SEOs.
Thousands more ranking factors have yet to be discussed at this time.
- 275 personalization factors
- 220 “web freshness” factors
- 3,186 image search factors
- 2,314 video search factors and many more.
Some factors are marked as TG_UNUSED (149) or TG_REMOVED (115).
Note: Several SEOs understand the factor tag TG_DEPRECATED wrong, in that they assume these 242 of the first 1922 are unused or removed. tag TG_DEPRECATED. This tag just means that the factor shall not be used in new “Matrixnet formulas”, but is still calculated in all existing ranking formulas.
That leaves 1,658 factors from the first batch of 1,922 to be active ranking factors, with many more in other parts of the code base. That is substantially more than the approximately 200 that Google has mentioned so far to us. As Google has already confirmed, Yandex also uses different algorithms and weightings depending on the search query. It differentiates by the time of day, commercialism, adult content, and many more queries. An initial list of ranking factor weights is also known. To break down into some details, I picked some interesting link-related factors.
As explained, there are many link-related factors in the code, and each of them deserves days of research and analysis on the code. But the available comments, also in the “common 1923 factors”, is also very useful.
I’ve always “preached” that links need to age like good old wine. There has been a short period when links would work immediately (mid-2003 when I started in SEO) in Google. But shortly after that, a lot of delays, filters, and in general link damping factors were introduced. All the papers about combating web link spam were written about that 20 years ago already, and we see these concepts in the Yandex leak clearly confirmed also.
b. Anchor Text matters
SEO practitioners know that anchor text matters for SEO big time, and that is confirmed also in the Yandex leak. Even in the “common 1922,” there are 146 factors ONLY related to the Anchor Text, marked with TG_LINK_TEXT. There are probably at least a couple dozen more in other areas like image search and video search. So we’re talking about roughly 200 factors (actually “Features”) that Yandex has defined for the link text. These factors are then multiplied by several aspects, like geolocation, topic, and many more the “MatrixNet” system appears, which is also documented online. My age-old saying that the rules for links and their effects are different per country, topic, language, etc., is confirmed. These two aspects could take weeks to learn more about if conducted by highly experienced, professional software engineers. It is still very early, especially since many people don’t even look at the code and its comments.
For example, there’s the tag TG_DEPRECATED. The code clearly states what it means:
- “the factor shall not be used anywhere, but it is still computed and may be present in existing formulas.”
- It clearly states that these factors can still be used in some MatrixNet “formulas.” That is the typical use of “Deprecation” of code elements.
Yet many SEOs discard those factors claiming that they would be unused. There is a tag in the code called TG_REMOVED (“the factor is completely removed”).
Of course, a Pagerank calculation is the first factor in the list. That was and is still the “secret” of Google, initially published in their famous “backrub paper”, but used still today in many variants at Google and Yandex as well.
There are Static and Dynamic factors. The Dynamic factors are evaluated on a per-query basis, like the FI_LINK_RELEV Link Relevance factor. The same method applies to anchor text-related factors.
e. Pagerank Bonus for Long-Tail Queries
With the factor FI_PAGE_RANK_BONUS, a page rank bonus is applied for long-tail queries. This page rank bonus is assigned dynamically to almost all two or more word requests, except for a tiny number of queries. Giving links more weight regarding long tail queries makes sense since they occur much less on the web. It must be necessary if there is an exact match between a three- or four-word anchor text and query. We see in the initial weights that these factors are all active in Yandex.
Actually, two factors, closely related are for assigning different weights for exact matches or phrase matches between anchor text and the query. This factor assignment must also be dynamic, so the value of links changes dynamically, depending on the query. While many link factors are considered static, I think this is very interesting. This means that links are interpreted stronger or weaker, depending on the search query.
g. BM25 for Anchor Text and Body Text
While many SEOs swear by TF*IDF it turns out that Yandex actually uses a different approach of similar age for their term weights. BM25 is a classical method in information retrieval, just different from all SEO on-page tools out there I know. This BM25 method is applied to links in one factor and to body text and links combined in another factor. Right now, some onpage-tool-vendors are hectically looking into implementing BM25 in their software.
- We’re just getting started. This provides a rough overview for you of what’s in there.
- We’re just scratching the surface here with so many more valuable insights ahead.
- But we were quite right in many assumptions and interpretations from the outside of how such an extensive search engine would work, at least regarding links.
- All in all, the Yandex code leak offers a fascinating insight into the inner workings of a modern search engine.
- Although not all of the findings can be directly applied to Google, many assumptions made in recent years about the general functioning of large Internet search engines are confirmed.
- I assume the SEO industry still has a few interesting months ahead of it with new insights from this leak.