Ask HN: Any indications Copilot scans your local files?

Ask HN: Any indications Copilot scans your local files? Ask HN: Any indications Copilot scans your local files? 145 points by polycaster 4 hours ago | hide | past | favorite | 54 comments A client of mine is just about to launch a startup. There's nothing public on the web yet. Today, while hacking a very minimal prototype of a HTML page into VSCode, I got a strange suggestion from GH Copilot. When I entered the client's name Copilot prompted me with a ready-made <article> block containing a marketing claim of their company. So far so nice.

Strange thing is the claim is not public and not contained in the codebase I'm currently working on, in no codebase I know of and to my knowledge, I'm the only related dev using Copilot. It's also not listed on Google, thus I assume it's not leaked somewhere else that could be indexed by GH (which is an assumption of course, but appears likely). But it can be found in a completely separate local folder with project assets that is not published on GH.

The marketing claim is about the length of a tweet and is not exactly generic. It requires understanding of my client's business (which again, cannot be derived from my codebase). So it's not GPT3 output that matches coincidentially.

The GH „About GitHub Copilot Telemetry” page [1] does not indicate that your locale file system is scanned though.

Can anyone explain that or observed a similar phenomenon?

[1] https://docs.github.com/en/github/copilot/about-github-copilot-telemetry

Yes, other files open in your IDE may also be scanned.

From terms of service [1] (Which I'm sure everyone reads)

> when you edit files with the GitHub Copilot extension/plugin enabled, file content snippets [...] will be shared with GitHub, Microsoft, and OpenAI, and used for diagnostic purposes to improve suggestions and related products. GitHub Copilot relies on file content for context, both in the file you are editing and potentially other files open in the same IDE instance.

[1] https://docs.github.com/en/github/copilot/github-copilot-tel...

Thanks for the pointer but I don't think this is what happened in my case. I never opened the marketing assets in VSCode. They reside in completely separate folders.

Are you using Windows? IIRC Microsoft collects a hefty amount of data from your filesystem.

The telemetry data does not (and should not) contain the file contents.

According to microsoft's documentation[1] it's unintentional. Presumably because it's picked up from whatever's in memory, rather than being collected intentionally.

>which may unintentionally contain user content, such as parts of a file you were using when the problem occurred

[1] https://docs.microsoft.com/en-us/windows/privacy/configure-w...

Another reason why turning off automatic bug reporting is a good idea.

If VSCode scanned your files to feed Copilot, it's not just Microsoft who can access that, but anyone using copilot. Every single file in every single machine that ever ran VSCode would potentially be at anyone's grasp.

Allowing this to happen would've been incredibly stupid. But it's worth investigating further.

> Every single file in every single machine that ever ran VSCode would potentially be at anyone's grasp.

Even if they were scanning all the files, Copilot's terms [1] explicitly says they do not use the data to provide suggestions for other users.

> GitHub Copilot does not use these URLs, file paths, or snippets collected in your telemetry as suggestions for other users of GitHub Copilot. This information is treated as confidential information and accessed on a need-to-know basis.

[1] https://docs.github.com/en/github/copilot/github-copilot-tel...

Last time I checked it was not implemented yet though. Maybe they are starting to deploy a version that use the other open files for context, but mine is obviously not doing that yet.

I'll second the plausible suggestion that perhaps the name/marketing copy combo isn't as unpredictable as you might think. Corporate speak and company names are pretty formulaic. Try running the name and <article> through GPT-3 and see what happens (or GPT-2 here: https://bellard.org/textsynth/)

E.g., I just prompted GPT-2 with a made up company name that doesn't have any google search results and got a completion like this:

<p><em>Fully featured webapp for social & mobile networks in the cloud</em></p>

Assuming you're on Windows, you can see all of a process's IO using a tool like Process Monitor: https://docs.microsoft.com/en-us/sysinternals/downloads/proc...

FYI: It's a firehose, but you should be able to filter it down to copilot and a path prefix. Then you'll know if it's being scanned.

Have you considered that your marketing blurb is actually, not that novel after all? GPT3 is damn convincing these days :)

This is a joke, but it may be the correct answer: if the company name is self-demonstrating, it's possible Codex could recognize that.

I don't think so. The phrase is 272 characters long and contains very specific terms which cannot be derived from the company's name. Also, it's a 1:1 match with the unpublished marketing material.

The entire point of GPT3 is to create novel content that is similar to existing content. So that copilot suggested it doesn't mean it's not novel as a whole.

Do the following experiment:

1) Put in the same folder where you had that marketing blurb, a new blurb. Make sure it is unique but still looks like english. Example: now your startup allows you to fly to the moon in rockets powered by angry tweets.

2) Try to force a reload of copilot. Maybe reinstall it?

3) Recreate the conditions that suggested the first blurb, but trying to suggest the second blurb.

4) Share the results, I am curious.

If you get that suggestion, you have proof and reproductible steps for others to get proof. If you don't get that suggestion, we can't be sure, but the odds of it being just a coincidence increase.

Good luck!

I’m assuming you meant “blurb” instead of “blur”?

Thanks corrected. While we are at it. Is it "make the following experiment" or "do the following experiment" ?

Yes, "do the following experiment" sounds more natural. IMO, "Conduct" conveys precision and is appropriate for more formal writing.

I noticed a very peculiar thing with copilot.

I was writing a twitter api wrapper and had hardcoded access token for a test user. Copilot when testing a function that requires a user id suggested one. I searched the id and it belonged to my test user. I searched my entire codebase but couldn't find any place I had used the particular id. The only place it could have extracted it from would have been access token which has user id as part of the string(which I hadn't noticed before this). Either this is a common code pattern or I don't know how to process this.

This is something I've seen with copilot with market data.

I was creating a unit test in a Go codebase and I had dumped the JSON that I was going to be decoding at the top of the file, and when I started writing the assertions, Copilot was very quick to use the data from the JSON, with quite high accuracy, based solely on me typing which ticker I was going to assert against.

I have seen that as well and it is impressive. But this is more than that. The code contained a string variable named accessToken with a string like "218276172612672-jash127hg27128h'(random data here, not the actual id), where 218276172612672 was the user id. When testing a function that required user id it not only suggested 218276172612672 but also did it with the full context participant.follow({twitterUserId:"218276172612672"})

Seems reasonable. What's your question? GPT-3 is just that good.

If it‘s always in the same location inside the api key, it seems very reasonable that it would pick up that pattern, as most projects that include a hardcoded key would include it. API key variables are probably named almost the same everywhere.

What was the user id? testuser1? Or something more convoluted?

Twitter user id for a user I created(so nothing popular as well). A 64 bit unsigned integer as string here.

Well this could be a huge security issue. Can lead to potentially Copilot-surfing for company secrets in a new form, since Copilot is already leaking secret API keys and copyrighted code.

The dangers of just regurgitating what has been read are unreal, since with good enough targeting you can read the data someone else wrote and expected to be anonymized. It's like huge global RAM of code, you just need to figure out how to get it to point at the right addresses.

Hacking has never been easier. Just type "username: thecupisblue, password:" and wait for autocomplete. :D

How exactly does Copilot not open Microsoft up to significant legal liability, when it has been demonstrated that copilot will regurgitate entire blocks of scanned code?

It's unclear whether MS would be directly liable for copyright infringement; the Aereo case[0] perhaps expanded a device-maker's direct liability for user actions. MS would likely escape secondary copyright liability because the verbatim suggestions are really rare, and the standard under the Sony VCR case[1] is whether the device is "capable of substantial non-infringing uses."

EDIT: The user would perhaps be the direct infringer as the person who made a copy of the code, even unwittingly.

Source: current law student taking copyright and writing a paper about Copilot.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

[1] https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

Fascinating, thank you! I imagine this would make it even more likely that large orgs would ban the use of VS Code.

Ya, I'd be curious to hear how orgs are thinking about it. It could be that the risk of a copyright judgment against the org is outweighed by the efficiency gains from Copilot. The risk involved has an element of discoverability – how the heck is the OSS developer whose small snippet of code gets sucked into a closed-source project going to learn of the infringement?

Other stuff within copyright makes infringement of a small Copilot snippet less likely too, and the recent Google v. Oracle decision reinforced the fact that copyright is an awkward fit within copyright and may sometimes get less protection than things like fictional books. Fair use also comes in here, too; remember something as "egregious" as Google copying 37 Java packages worth of package / class / method declarations into Android was held a fair use. Declaring code is sort of special because it's more like an interface, but the Court really reinforced the notion that it likes cool new technology stuff that opens new markets, makes new products, etc., and a fair use finding for small snippets of code is in line with that as adapting copyright doctrine in the face of tech changes like AI / ML.

I think it also reads your clipboard. Yesterday, I had copied something from stackoverflow, and was about to paste it, and it gave me a suggestion before I could even drop it in.

Is the name of the company unique and very random or something more in the trend of "WeatherForecastsForFishermen Inc"?

Are you sure the AI couldn't get some context from the current file? No title tag in the head, no description nor keywords ? The filename/path is also used, could it be it?

Oof, that doesn't sound great.

Just out of curiosity, did you seek permission to use Copilot from your client? I wonder how widely accepted it is in roles which handle sensitive data.

Surely this could be verified in a VM?

I have Copilot enabled in a single workspace and tried some unique keywords from other projects (where Copilot is not enabled) and it could not generate anything similar.

Is it possible Copilot is using a network pretrained on a large text dataset that did contain the marketing blurb, then retrained for code prediction from GitHub source code? That might explain why it has memorized non-GitHub content (it's a bit of a reach though).

Have you tried using github search to see if it's popping up there?

Not until now, thanks! There are no matches. The claim contains my client's company name. And not even the company name yields any search results.

Copilot is quite good at substituting variables and names. The question is whether you can find the claim for another company.

I don't want to cast any aspersions but could this indicate that the marketing copy was "reused" from a public source?

Or perhaps the writer used their same phrases for two clients?

Unlikely. The claim (which I'm sorry I cannot post here for obvious reasons) is rather specific to my client's business which again is very specific in itself.

It's crazy to me that amongst the many things I need to create legal blankets over when hiring developers, I now need to worry about what IDE they are using because Copilot ostensibly has access to their (our) source code, files, and other proprietary information.

Have you told your client that you are making use of tools that upload the intellectual property they’ve shared with you / they’ve paid you to create for them, onto third-party servers?

If I was paying for work and the contractor was uploading the end result onto some sort of shared AI training set, we would not be working together for very long.

You may have brought up an excellent point that needs to be inserted into new legal contracts — either an opt in or out regarding the use of tools of various kinds that upload data into the cloud to “help” you. Maybe other companies would be okay with stuff like Copilot if it allows them to pay less money for developers who can’t write proper code without it, or something. I don’t know. I know that I want nothing to do with these sorts of systems, and I don’t want any of my code anywhere near it. I’ll definitely try to make sure nobody with access to my private repos has any of that nonsense enabled.

Maybe the legal version of Copilot can write an appropriate contract clause for us?

> Have you told your client that you are making use of tools that upload the intellectual property they’ve shared with you / they’ve paid you to create for them, onto third-party servers?

Related: Have any of the big tech companies banned their employees from using Copilot yet?

> uploading the end result onto some sort of shared AI training set

Does Copilot use the data it collects for training their AI? From their terms [1]

[1] https://docs.github.com/en/github/copilot/github-copilot-tel...

> for other users

So someone just needs access to your account to get snippets from your private code.

> you are making use of tools that upload the intellectual property [...] onto third-party servers?

That's the whole point: if they did, it was NOT willingly: they have the intellectual property stored in a completely separate folder, which should not be uploaded to Github.

The wider point is that you cannot trust these tools. You cannot trust these companies. It has been demonstrated numerous times.

Recommend

How to write a custom event system in JavaScript

A quantum walk down Wall Street

连载《Chrome V8 原理讲解》第五篇 V8语法分析器源码讲解

Hydrogen: Shopify's opinionated framework for building headless ecommerce

阅文要收购米读，网文免费俨然已经势不可挡

Only 50 Android Apps Currently Available on Windows 11

19岁的小李子提名奥斯卡采访，神仙颜值

EDG捧杯的那夜，命运不止眷顾天才少年

Microsoft and SEGA Announce a New Strategic Alliance

Announcing .NET 6 - The Fastest .NET Yet - .NET Blog

About Joyk