

Ask HN: Any indications Copilot scans your local files?
source link: https://news.ycombinator.com/item?id=29150011
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Strange thing is the claim is not public and not contained in the codebase I'm currently working on, in no codebase I know of and to my knowledge, I'm the only related dev using Copilot. It's also not listed on Google, thus I assume it's not leaked somewhere else that could be indexed by GH (which is an assumption of course, but appears likely). But it can be found in a completely separate local folder with project assets that is not published on GH.
The marketing claim is about the length of a tweet and is not exactly generic. It requires understanding of my client's business (which again, cannot be derived from my codebase). So it's not GPT3 output that matches coincidentially.
The GH „About GitHub Copilot Telemetry” page [1] does not indicate that your locale file system is scanned though.
Can anyone explain that or observed a similar phenomenon?
[1] https://docs.github.com/en/github/copilot/about-github-copilot-telemetry
From terms of service [1] (Which I'm sure everyone reads)
> when you edit files with the GitHub Copilot extension/plugin enabled, file content snippets [...] will be shared with GitHub, Microsoft, and OpenAI, and used for diagnostic purposes to improve suggestions and related products. GitHub Copilot relies on file content for context, both in the file you are editing and potentially other files open in the same IDE instance.
[1] https://docs.github.com/en/github/copilot/github-copilot-tel...




>which may unintentionally contain user content, such as parts of a file you were using when the problem occurred
[1] https://docs.microsoft.com/en-us/windows/privacy/configure-w...


Allowing this to happen would've been incredibly stupid. But it's worth investigating further.

Even if they were scanning all the files, Copilot's terms [1] explicitly says they do not use the data to provide suggestions for other users.
> GitHub Copilot does not use these URLs, file paths, or snippets collected in your telemetry as suggestions for other users of GitHub Copilot. This information is treated as confidential information and accessed on a need-to-know basis.
[1] https://docs.github.com/en/github/copilot/github-copilot-tel...

E.g., I just prompted GPT-2 with a made up company name that doesn't have any google search results and got a completion like this:
<p><em>Fully featured webapp for social & mobile networks in the cloud</em></p>
FYI: It's a firehose, but you should be able to filter it down to copilot and a path prefix. Then you'll know if it's being scanned.



1) Put in the same folder where you had that marketing blurb, a new blurb. Make sure it is unique but still looks like english. Example: now your startup allows you to fly to the moon in rockets powered by angry tweets.
2) Try to force a reload of copilot. Maybe reinstall it?
3) Recreate the conditions that suggested the first blurb, but trying to suggest the second blurb.
4) Share the results, I am curious.
If you get that suggestion, you have proof and reproductible steps for others to get proof. If you don't get that suggestion, we can't be sure, but the odds of it being just a coincidence increase.
Good luck!



I was writing a twitter api wrapper and had hardcoded access token for a test user. Copilot when testing a function that requires a user id suggested one. I searched the id and it belonged to my test user. I searched my entire codebase but couldn't find any place I had used the particular id. The only place it could have extracted it from would have been access token which has user id as part of the string(which I hadn't noticed before this). Either this is a common code pattern or I don't know how to process this.

I was creating a unit test in a Go codebase and I had dumped the JSON that I was going to be decoding at the top of the file, and when I started writing the assertions, Copilot was very quick to use the data from the JSON, with quite high accuracy, based solely on me typing which ticker I was going to assert against.





The dangers of just regurgitating what has been read are unreal, since with good enough targeting you can read the data someone else wrote and expected to be anonymized. It's like huge global RAM of code, you just need to figure out how to get it to point at the right addresses.


EDIT: The user would perhaps be the direct infringer as the person who made a copy of the code, even unwittingly.
Source: current law student taking copyright and writing a paper about Copilot.
[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....
[1] https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....


Other stuff within copyright makes infringement of a small Copilot snippet less likely too, and the recent Google v. Oracle decision reinforced the fact that copyright is an awkward fit within copyright and may sometimes get less protection than things like fictional books. Fair use also comes in here, too; remember something as "egregious" as Google copying 37 Java packages worth of package / class / method declarations into Android was held a fair use. Declaring code is sort of special because it's more like an interface, but the Court really reinforced the notion that it likes cool new technology stuff that opens new markets, makes new products, etc., and a fair use finding for small snippets of code is in line with that as adapting copyright doctrine in the face of tech changes like AI / ML.
Are you sure the AI couldn't get some context from the current file? No title tag in the head, no description nor keywords ? The filename/path is also used, could it be it?
Just out of curiosity, did you seek permission to use Copilot from your client? I wonder how widely accepted it is in roles which handle sensitive data.
I have Copilot enabled in a single workspace and tried some unique keywords from other projects (where Copilot is not enabled) and it could not generate anything similar.


Or perhaps the writer used their same phrases for two clients?

If I was paying for work and the contractor was uploading the end result onto some sort of shared AI training set, we would not be working together for very long.
You may have brought up an excellent point that needs to be inserted into new legal contracts — either an opt in or out regarding the use of tools of various kinds that upload data into the cloud to “help” you. Maybe other companies would be okay with stuff like Copilot if it allows them to pay less money for developers who can’t write proper code without it, or something. I don’t know. I know that I want nothing to do with these sorts of systems, and I don’t want any of my code anywhere near it. I’ll definitely try to make sure nobody with access to my private repos has any of that nonsense enabled.
Maybe the legal version of Copilot can write an appropriate contract clause for us?

Related: Have any of the big tech companies banned their employees from using Copilot yet?

Does Copilot use the data it collects for training their AI? From their terms [1]
> GitHub Copilot does not use these URLs, file paths, or snippets collected in your telemetry as suggestions for other users of GitHub Copilot. This information is treated as confidential information and accessed on a need-to-know basis.
[1] https://docs.github.com/en/github/copilot/github-copilot-tel...

So someone just needs access to your account to get snippets from your private code.

That's the whole point: if they did, it was NOT willingly: they have the intellectual property stored in a completely separate folder, which should not be uploaded to Github.


Search:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK