Experiment: GitHub code search with de-duplication →

github-code-search

This is a demo of GitHub code search that de-duplicates results from identical files. It compares SHA1 hashes and file contents to mark files as identical, and combine their results into one.

It's a proof-of-concept, rather than a hardened search tool.

Dear GitHub, please steal these ideas for the first-party code search :wink:

vY3uueE.png!web

Installation

Clone this repository, create a new virtualenv with Python 3.6 or later, and install the dependencies.

$ git clone [email protected]:alexwlchan/github-code-search.git
$ cd github-code-search
$ virtualenv env
$ source env/bin/activate
$ pip3 install -r requirements.txt

You also need a personal access token for GitHub, which you can get from the GitHub developer settings . This token needs the public_repo scope.

Usage

Run the search_github.py script, passing your query and API token:

$ python search_github.py "lang:python requests.get" --api_token=abc123

This will load the search results, render them as an HTML file, and open the file in your web browser.

Next steps

This is a proof-of-concept I wrote in a single train journey, not a hardened application. I'm not planning to work on it any further, but I did have some ideas on what you could do next:

Should duplicate results weight higher in the search?If the same file appears in 100 repos, should that increase the search ranking? I'm not sure -- it would be interesting to experiment.
Detect nearly-duplicate files.If two files are the same, except for some lines that are unrelated to the search, treat those two files as the same. This requires more sophisticated diffing logic.
Highlight search terms inside the code snippets.If you use regular GitHub code search, it highlights the search terms within the code snippet. I don't do that yet, but there's (probably) enough information in the API to do that.
Make it faster!Right now it's pretty slow -- it has to fetch the contents of every unique file that appears in the search results. Parallelising the HTTP requests or doing something fancy with GraphQL to reduce the number of requests would make it faster.
Pagination.Right now it only uses the first page of results, even though the API is paginated. It'd be nice to expose later results in some way.
Give more visibility into the duplicate results.Any duplicate results are just hidden behind "+N duplicates". Completely hiding them is probably a mistake – there might be circumstances in which you want to see the duplicates (although I don't think that's the common use case). It would be good if there was some way to see them if you really wanted.

License

MIT.

github-code-search

Installation

Usage

Next steps

License

Recommend

Scala和Golang并发实现对比

GitHub - liuchengxu/vim-clap: A modern generic interactive finder and dispatcher...

Maven Repository: ch.qos.logback » logback-classic

比利时国王授马云皇冠勋章：为唯一获该等级的中国人

Publishing Go Modules - The Go Blog

科大讯飞蜕变：“人工智能第一股”初养成

传音控股上市在即遭华为起诉 630项专利成非洲之王？

苏宁放大招！金服公司完成百亿增资，要独立上市？

GitHub - gabime/spdlog: Fast C++ logging library.

GitHub - firebase/extensions: Source code for official Firebase extensions

About Joyk