3

Reddit 當初對 Google 搜尋引擎的客製化設計

 1 week ago
source link: https://blog.gslin.org/archives/2024/04/20/11750/reddit-%e7%95%b6%e5%88%9d%e5%b0%8d-google-%e6%90%9c%e5%b0%8b%e5%bc%95%e6%93%8e%e7%9a%84%e5%ae%a2%e8%a3%bd%e5%8c%96%e8%a8%ad%e8%a8%88/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Reddit 當初對 Google 搜尋引擎的客製化設計

Hacker News 上看到的討論,裡面剛好有 Reddit 第一個雇用的工程師,Jeremy Edberg 的留言:「Reddit is taking over Google (businessinsider.com)」。

id=40068381 這邊提到了不少東西,首先是把 title 放進 url 裡面的作法:

To this day, my most public contribution to reddit is that I wrote the code to put the title of the post in the URL. That was done specifically for SEO purposes.

這個在 Google Webmasters (現在叫做 Google Search Console) 也針對 Reddit 處理,將速率強制設為 Custom,不讓 Reddit 的人改:

It was pretty much the only SEO optimization we ever did (along with a few DOM changes), because shortly after that, Google basically dedicated engineering effort specifically to crawling reddit. So much so that we lost the "crawl rate" button in our SEO admin page on Google, it was just set to "Custom".

後續還要針對 Google 的抓取在 load balancer 上把流量拆開處理,不然 crawling pattern 與一般使用情境很不一樣,會造成 cache 的效率極度低落:

I had to stand up a fleet of app servers and caches and databases, and change the load balancers so that Google basically had their own infrastructure (although we would shunt all crawlers there). Crawler traffic was very different than regular traffic -- it looked at pages more than two days old, something humans rarely did at the time. It would blow out every cache (memory, database, disk, etc.). So we put them on their own infra to keep them from killing the rest of the site.

這些算是頗有趣的經驗?

Related

Startpage 被廣告公司收購

從 Hacker News 上看到 Reddit 上的消息 (看起來有陣子了):「Startpage is now owned by an advertising company」。 Startpage 算是之前有在用的 default search engine,但發現有很多 bug 後就不太用了。目前還是先設 DuckDuckGo,然後在需要的時候用之前寫的 press-g-to-google-duckduckgo 切到 Google 去找... DuckDuckGo 還是有搜尋品質的問題...

October 28, 2019

In "Computer"

Google SRE 團隊整理出過去二十年的十一條心得

Google 的 SRE 團隊整理出過去二十年的心得,當看故事的心態在看的:「Lessons Learned from Twenty Years of Site Reliability Engineering」,在 Hacker News 上也有討論:「Lessons Learned from Twenty Years of Site Reliability Engineering (sre.google)」。 裡面的項目大多都會在公司成長時不斷的導入,都是夠大就會遇到的。 比較有趣的是第六條,這是唯一一條全部都用大寫字母列出來的: COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!! 到 Google 這個規模的架構,這邊就會規劃找完全獨立於 Google 架構的方案來用;我猜應該是傳統的 colocation 機房 (像是 AT&T 之類的),上面跑 IRC server 之類的?…

October 29, 2023

In "AWS"

目前 Reddit 的替代方案

看到「sub.rehab · Find your next diving spot」這個頁面,在整理目前 Reddit 社群的其他出處。 從目前的資料看起來,Lemmy 應該是主要方案,有些可能自架,但蠻多人就是跑去找一個 instance 掛? 第二多的是轉移到 Discord 上,這點蠻特別的... 而因為 Discord 的封閉性,也看到了「Answer Overflow - Index Your Discord Server Channels Into Google」這種服務,可以把 Discord 的內容轉成 html 頁面,讓搜尋引擎可以讀到內容。 所以這波 Reddit 決定來硬的到底會不會成呢...

June 24, 2023

In "Computer"

a611ee8db44c8d03a20edf0bf5a71d80?s=49&d=identicon&r=gAuthor Gea-Suan LinPosted on April 20, 2024Categories Computer, Murmuring, Network, Search Engine, Service, WWWTags engine, google, performance, reddit, search, seo, title, url

Leave a Reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Learn More)

Post navigation


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK