62

Experiment: GitHub code search with de-duplication →

 4 years ago
source link: https://www.tuicool.com/articles/iA7rm2f
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

github-code-search

This is a demo of GitHub code search that de-duplicates results from identical files. It compares SHA1 hashes and file contents to mark files as identical, and combine their results into one.

It's a proof-of-concept, rather than a hardened search tool.

Dear GitHub, please steal these ideas for the first-party code search :wink:

vY3uueE.png!web

Installation

Clone this repository, create a new virtualenv with Python 3.6 or later, and install the dependencies.

$ git clone [email protected]:alexwlchan/github-code-search.git
$ cd github-code-search
$ virtualenv env
$ source env/bin/activate
$ pip3 install -r requirements.txt

You also need a personal access token for GitHub, which you can get from the GitHub developer settings . This token needs the public_repo scope.

Usage

Run the search_github.py script, passing your query and API token:

$ python search_github.py "lang:python requests.get" --api_token=abc123

This will load the search results, render them as an HTML file, and open the file in your web browser.

Next steps

This is a proof-of-concept I wrote in a single train journey, not a hardened application. I'm not planning to work on it any further, but I did have some ideas on what you could do next:

  • Should duplicate results weight higher in the search?If the same file appears in 100 repos, should that increase the search ranking? I'm not sure -- it would be interesting to experiment.

  • Detect nearly-duplicate files.If two files are the same, except for some lines that are unrelated to the search, treat those two files as the same. This requires more sophisticated diffing logic.

  • Highlight search terms inside the code snippets.If you use regular GitHub code search, it highlights the search terms within the code snippet. I don't do that yet, but there's (probably) enough information in the API to do that.

  • Make it faster!Right now it's pretty slow -- it has to fetch the contents of every unique file that appears in the search results. Parallelising the HTTP requests or doing something fancy with GraphQL to reduce the number of requests would make it faster.

  • Pagination.Right now it only uses the first page of results, even though the API is paginated. It'd be nice to expose later results in some way.

  • Give more visibility into the duplicate results.Any duplicate results are just hidden behind "+N duplicates". Completely hiding them is probably a mistake – there might be circumstances in which you want to see the duplicates (although I don't think that's the common use case). It would be good if there was some way to see them if you really wanted.

License

MIT.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK