36

PageFish search algorithm

 5 years ago
source link: https://www.tuicool.com/articles/hit/3YjUnyY
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

pagefish

Search engine with a static webserver

Example usage

  1. Go to https://peterburk.github.io/pagefish/index.html
  2. Type a city name.
  3. Click search.
  4. See the results.
  5. (optional) Copy the relation ID to MeetYourMappers

How the hashing algorithm works:

1. Convert every character in the query to a number. (Peter = 80, 101, 116, 101, 114)

2. Add up the total of those numbers (80 + 101 + 116 + 101 + 114 = 512).

3. Get the Unicode character for that number (512 = Ȁ).

4. Load the index file that uses that character as a filename. (GeoNamesIndex/Ȁ.txt)

5. Grep the query again within that file (clashes exist, e.g. "Dover" = 68 + 111 + 118 + 101 + 114 = 512)

Why this is important:

+ The router and filesystem help the search.

+ Characters are just numbers. It's faster to search shorter strings.

+ Server-side operations cost money. Client-side JavaScript is free.

+ The index can be stored on a static file server e.g. Github.

Known limitations:

- Fuzzy matching isn't supported (but this is not a problem for Chinese)

- Some characters aren't allowed in filenames, URLs, etc.

- The hash algorithm should be balanced so that index files are a similar size.

Q7rUjqJ.png!web

Benchmarks:

time grep -c "Paris" geonames-relations-name-id-lat-lon.txt

0.382s = 382 ms

Typical search time on local: 30 ms.

Typical search time on Github: 200 ms.

This gives a 10x speedup compared to grep.

How to build the index:

cc -lm geonamesToIndex.c -o geonamesToIndex

time cat geonames_relations.txt | ./geonamesToIndex | ./geonamesToIndex

17.185s

Having many small files in the index uses more space. But storage is cheap.

Original file geonames-relations-name-id-lat-lon.txt: 25.8 MB

Index size GeoNamesIndex: 94 MB

About the author

Peter Burkimsher is currently working for OSE, a memory card manufacturer in Kaohsiung, Taiwan.

I did several projects during my 3.5 years here, including writing bit-level drivers to control memory card testing equipment, a Win32 app to parse log files and send them to a SQL server, network monitoring, and Big Data processing.

I studied Electronic Systems Engineering at Lancaster University and graduated with first-class honours in 2011, before going off on Working Holiday visas to see the world.

Now I have 3 years continuous relevant work experience, I’m looking for a suitable job in New Zealand so I can get the Skilled Migrant Category visa.

(Canada’s Express Entry and Australia are also options).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK