Launch HN: Bloop (YC S21) – Code Search with GPT-4 - JOYK Joy of Geek, Geek News, Link all geek

Launch HN: Bloop (YC S21) – Code Search with GPT-4 Launch HN: Bloop (YC S21) – Code Search with GPT-4 151 points by louiskw 7 hours ago | hide | past | favorite | 85 comments

Hey HN, we’re the cofounders of bloop (https://bloop.ai/), a code search engine which combines semantic search with GPT-4 to answer questions. We let you search your private codebases either the traditional way (regex or literal) or via semantic search, ask questions in natural language thanks to GPT-4, and jump between refs/defs with precise code navigation. Here’s a quick demo: https://www.loom.com/share/8e9d59b88dd2409482ec02cdda5b9185

Traditional code search tools match the terms in your query against the codebase, but often you don’t know the right terms to start with, e.g. ‘Which library do we use for model inference?’ (These types of questions are particularly common when you’re learning a new codebase.) bloop uses a combination of neural semantic code search (comparing the meaning - encoded in vector representations - of queries and code snippets) and chained LLM calls to retrieve and reason about abstract queries.

Ideally, a LLM could answer questions about your code directly, but there is significant overhead (and expense) in fine-tuning the largest LLMs on private data. And although they’re increasing, prompt sizes are still a long way off being able to fit a whole organisation’s codebase.

We get around these limitations with a two-step process. First, we use GPT-4 to generate a keyword query which is passed to a semantic search engine. This embeds the query and compares it to chunks of code in vector space (we use Qdrant as our vector DB). We’ve found that using a semantic search engine for retrieval improves recall, allowing the LLM to retrieve code that doesn’t have any textual overlap with the query but is still relevant. Second, the retrieved code snippets are ranked and inserted into a final LLM prompt. We pass this to GPT-4 and its phenomenal understanding of code does the rest.

Let’s work through an example. You start off by asking ‘Where is the query parsing logic?’ and then want to find out ‘Which library does it use?’. We use GPT-4 to generate the standalone keyword query: ‘query parser library’, which we then pass to a semantic search engine that returns a snippet demonstrating the parser in action: ‘let pair = PestParser::parse(Rule::query, query);’. We insert this snippet into a prompt to GPT-4, which is able to work out that pest is the library doing the legwork here, generating the answer ‘The query parser uses the pest library’.

You can also filter your search by repo or language - What’s the readiness delay repo:myApp lang:yaml. GPT-4 will generate an answer constrained to the respective repo and language.

We also know that LLMs are not always (at least not yet) the best tool for the job. Sometimes you know exactly what you’re looking for. For this, we’ve built a fast, trigram index based regex search engine based on Tantivy. Because of this, bloop is fast at traditional search too. For code navigation, we’ve built a precise go-to-ref/def engine based on scope resolution that uses Tree-sitter.

bloop is fully open-source. Semantic search, LLM prompts, regex search and code navigation are all contained in one repo: https://github.com/bloopAI/bloop.

Our software is standalone and doesn’t run in your IDE. We were originally IDE-based but moved away from this due to constraints on how we could display code to the user.

bloop runs as a free desktop app on Mac, Windows and Linux: https://github.com/bloopAI/bloop/releases. On desktop, your code is indexed with a MiniLM embedding model and stored locally, meaning at index time your codebase stays private. Indexing is fast, except on the very largest repos (GPU indexing coming soon). ‘Private’ here means that no code is shared with us or OpenAI at index time, and when a search is made only relevant code snippets are shared to generate the response. (This is more or less the same data usage as Copilot).

We also have a paid cloud offering for teams ($12 per user per month). Members of the same organisation can search a shared index hosted by us.

We’d love to hear your thoughts about the product and where you think we should take it next, and your thoughts on code search in general. We look forward to your comments!

Launch HN: Bloop (YC S21) – Code Search with GPT-4

Recommend

HDD average life span misses 3-year mark in study of 2,007 defective drives

做了1年运营，还不如一个应届生？这项能力很重要！ | 运营派

英伟达GeForce RTX 4070显卡确认将采用12GB显存

你好 2023！迟来的新年祝福以及近况更新

IT 预算削减无法触及的地方：网络安全

NOW THIS: UBS Group AG rescues Credit Suisse Group AG

3月进口游戏版号公布《闪耀！优俊少女》等27款游戏获批

悬疑剧第三年，又一场对台戏

Ulefone New Products AliExpress Pre-Order Kick Off

华硕宣布RTX 4080 Noctua Edition上市：新款猫头鹰联名显卡售价1650美元

About Joyk