
SOPA Images/Getty Images
Nearly 45GB of source code files, allegedly stolen by a former employee, have revealed the basis of Russian tech giant Yandex’s many apps and services. It also revealed key ranking factors for Yandex’s search engine, the kind that were almost never disclosed publicly.
The “Yandex git sources” were posted as a torrent file on January 25 and show files apparently taken in July 2022 and dating back to February 2022. Software engineer Arseniy Shestakov claims he confirmed with current and former Yandex employees that some archives “certainly contain modern source code for enterprise services.” Yandex told security blog BleepingComputer that “Yandex was not hacked” and that the leak came from a former employee. Yandex stated that it did not “see any threat to user data or platform performance.”
The documents specifically date to February 2022, when Russia began a full-scale invasion of Ukraine. A former executive at Yandex told BleepingComputer that the leak was “political” and noted that the former employee had not tried to sell the code to Yandex competitors. Anti-spam code was not leaked either.
While it’s not clear whether there are security or structural implications of Yandex’s source code disclosure, the leak of 1,922 ranking factors in Yandex’s search algorithm is certainly making waves. SEO consultant Martin MacDonald described the hack on Twitter as “probably the most interesting thing to happen in SEO in years” (as noted by Search Engine Land). In a thread detailing some of the more notable factors, suggests Alex Buraks that “there is a lot of useful information for Google SEO as well.”
Yandex, the fourth-ranked search engine by volume, reportedly employs several former Google employees. Yandex tracks many of Google’s ranking factors, identifiable in its code, and competes heavily with Google. Google’s Russian division recently filed for bankruptcy after losing its bank accounts and payment services. Buraks notes that the first factor in Yandex’s list of ranking factors is “PAGE_RANK,” which is apparently tied to the basic algorithm created by Google’s co-founders.
As described by Buraks (i two threads), Yandex’s engine favors pages such as:
- Not too old
- Have a lot of organic traffic (unique visitors) and less search-driven traffic
- Have fewer numbers and slashes in the URL
- Have optimized code instead of “hard pessimization,” with a “PR=0”
- Is hosted on reliable servers
- Coincidentally, Wikipedia pages are or are linked from Wikipedia
- Is hosted or linked from higher-level pages on a domain
- Include keywords in the URL (up to three)
You can search and click through all the factors on Rob Ousbey’s compiled search tool. You may notice that almost 1000 of the ranking factors are tagged “TG_DEPRECATED” and more than 200 are listed as “TG_UNUSED”. Because the code is from February 2022 and was retrieved in July 2022, Yandex’s search has certainly changed since then. But the leak provides a rare insight into how search rankings are compiled on a website that serves one of the world’s largest countries.
Yandex previously saw its search engine code go out the door in 2015, when a former employee tried to sell it on the black market for $28,000 to fund his own startup. The surprisingly low figure for the core code of Yandex’s main product suggested that he was unaware of its real value. The employee was sentenced to two years’ probation, and the code was never seen publicly.