OBD Convo #26: Forth Eorlingas!

Cryso Agori · Jan 31, 2023

[2301.13188] Extracting Training Data from Diffusion Models

arxiv.org

Lol, twitter artists tryna say gotcha with this paper when

4.2.1 Extraction Methodology Our extraction approach adapts the methodology from prior work [11] to images and consists of two steps:

1. Generate many examples using the diffusion model in the standard sampling manner and with the known prompts from the prior section.

2. Perform membership inference to separate the model’s novel generations from those generations which are memorized training examples

Generating many images. The first step is trivial but computationally expensive: we query the Gen function in a black-box manner using the selected prompts as input. To reduce the computational overhead of our experiments, we use the timestep-resampled generation implementation that is available in the Stable Diffusion codebase [58]. This process generates images in a more aggressive fashion by removing larger amounts of noise at each time step and results in slightly lower visual fidelity at a significant (∼10×)performance increase. We generate 500 candidate images for each text prompt to increase the likelihood that we find memorization.

In order to evaluate the effectiveness of our attack, we select the 350,000 most-duplicated examples from the training dataset and generate 500 candidate images for each of these prompts (totaling 175 million generated images). We first sort all of these generated images by ordering them by the mean distance between images in the clique to identify generations that we predict are likely to be memorized training data. We then take each of these generated images and annotate each as either “extracted” or “not extracted” by comparing it to the training images under Definition 1. We find 94 images are (2,0.15)extracted. To ensure that these images not only match some arbitrary definition, we also manually annotate the top-1000 generated images as either memorized or not memorized by visual analysis, and find that a further 13 (for a total of 109 images) are near-copies of training examples even if they do not fit our 2-norm definition. Figure 3 shows a subset of the extracted images that are reproduced with near pixel-perfect accuracy; all images have an 2 difference under 0.05. (As a point of reference, re-encoding a PNG as a JPEG with quality level 50 results in an 2 difference of 0.02 on average.) Given our ordered set of annotated images, we can also compute a curve evaluating the number of extracted images to the attack’s false positive rate. Our attack is exceptionally precise: out of 175 million generated images, we can identify 50 memorized images with 0 false positives, and all our memorized images can be extracted with a precision above 50%. Figure 4 contains the precision-recall curve for both memorization definitions. Measuring (k, ,δ)-eidetic memorization. In Definition 2 we introduced an adaptation of Eidetic memorization [11] tailored to the domain of generative image models. As mentioned earlier, we compute similarity between pairs of images with a direct 2 pixel-space similarity. This analysis is computationally expensive as it requires comparing each of our memorized images against each of the 160 million training examples. We set δ = 0.1 as this threshold is sufficient to identify almost all small image corruptions (e.g., JPEG compression, small brightness/contrast adjustments) but has very few false positives. Figure 5 shows the results of this analysis. While we identify little Eidetic memorization for k < 100, this is expected due to the fact we choose prompts of highlyduplicated images. Note that at this level of duplication, the duplicated examples still make up just one in a million training examples. These results show that duplication is a major factor behind training data extraction.

El Hermano · Jan 31, 2023

El Hermano · Jan 31, 2023

Uoruk · Jan 31, 2023

Thegoldenboy2188 said:
[2301.13188] Extracting Training Data from Diffusion Models

arxiv.org

Lol, twitter artists tryna say gotcha with this paper when

?????

Copying an image is still copying an image dude and "It's not the real image, just a really really really good copy" is not a legal defense lol

Claudio Swiss · Jan 31, 2023

Gordo · Jan 31, 2023

Infinite Fusion Calculator

Pokémon fusion calculator/generator

Deoxys/Mewtwo fusion

Gordo · Jan 31, 2023

rise up Ho-oh/Rapidash :lit

Xhominid The Apex · Jan 31, 2023

Uoruk said:
?????

Copying an image is still copying an image dude and "It's not the real image, just a really really really good copy" is not a legal defense lol

This. A better legal defense would be pointing out that it's no different than Google Images or similar shit like that and would equally fall under fair use from that system onwards.

Gordo · Jan 31, 2023

Lucario/Zoroark :why

Claudio Swiss · Jan 31, 2023

MotherFcker Ultimate · Feb 1, 2023

Claudio Swiss said:

what happened to him, i saw people talking about something bad that happened to him in the comments of be our guest reanimated but couldnt find what it was

Cryso Agori · Feb 1, 2023

Uoruk said:
?????

Copying an image is still copying an image dude and "It's not the real image, just a really really really good copy" is not a legal defense lol

Point is that the results are extremely cherrypicked.

First off they forced the ai into producing copies by generating millions of images, taking the most similar ones and generating based of that, hundreds of times.

Obviously if you are pushing the ai into making a single result eventually you'll get it. But the ai randomly generating a copy is a less than a million chance normally (this is stated in the paper itself) and it took millions of generations to actually get said copies.

Two the second part with making their own models is a classic case of overfitting, where a single image repeats in the dataset so many times the AI actually does start copying it. It should be noted though that this is a problem.

I myself ran into this problem when training my own model.

Here is the Huggingface page for it where you can see the four images I used to Dreambooth train SD provided by @Nevermind

https://huggingface.co/sd-dreambooth-library/oldcybrpunk

When I first trained the model I put the steps to low (200) and got back these images which are copies due to not being trained long enough.

After that I raised the training steps to 1500, and didn't adjust the training set, these are what I got back.

The ai is capable of generating concepts not in the four images and while some of the motoko images have the problem that some look similar to the original images, thats due to having a such a small training set. In a large dataset overfitting problems shouldn't really be an issue.

The reason why the lab gets these overfitted results on the other hand is because they literally train the ai on copies of the same image.

So more or less I think saying that the AI is copying is inappropriate since more accurately the lab is pushing the ai to generate copies for results. It would be way more accurate is they used a base SD model generated a bunch of images and tried to find copies in that bunch. Instead of forcing the ai to make a bunch of copies. It's like getting mad that your images is on Library Of Babel when it's at a basic level a infinite random generator that has infinite combinations and that one of those combinations just so happened to be a copy of your Image. (This isn't how library of babel actually works, but I don't want to go into detail rn)

Xhominid the Fool said:
This. A better legal defense would be pointing out that it's no different than Google Images or similar shit like that and would equally fall under fair use from that system onwards.

No, cause that is not how the ai works. The ai doesnt save images as data or latent space, it's not looking up images like Google, it's making them of 'imagination' (said imagination is equations, and prompt values, but in simple terms the ai does kinda 'imagine' the image.

Claudio Swiss · Feb 1, 2023

MotherFcker Ultimate said:
what happened to him, i saw people talking about something bad that happened to him in the comments of be our guest reanimated but couldnt find what it was

was in a car accident lost his bro and best friend

MotherFcker Ultimate · Feb 1, 2023

zoroark/ninetails, god damn this goes so hard

MotherFcker Ultimate · Feb 1, 2023

Claudio Swiss said:
was in a car accident lost his bro and best friend

fuck

CrossTheHorizon · Feb 1, 2023

MotherFcker Ultimate said:
what happened to him, i saw people talking about something bad that happened to him in the comments of be our guest reanimated but couldnt find what it was

He was in a car wreck with some family members, both of whom died

Masterblack06 · Feb 1, 2023

TikTok - Make Your Day

www.tiktok.com

Welp there goes SeththeProgrammer

Uoruk · Feb 1, 2023

Thegoldenboy2188 said:
Point is that the results are extremely cherrypicked.

First off they forced the ai into producing copies by generating millions of images, taking the most similar ones and generating based of that, hundreds of times.

Obviously if you are pushing the ai into making a single result eventually you'll get it. But the ai randomly generating a copy is a less than a million chance normally (this is stated in the paper itself) and it took millions of generations to actually get said copies.

Two the second part with making their own models is a classic case of overfitting, where a single image repeats in the dataset so many times the AI actually does start copying it. It should be noted though that this is a problem.

I myself ran into this problem when training my own model.

Here is the Huggingface page for it where you can see the four images I used to Dreambooth train SD provided by @Nevermind

https://huggingface.co/sd-dreambooth-library/oldcybrpunk

When I first trained the model I put the steps to low (200) and got back these images which are copies due to not being trained long enough.

After that I raised the training steps to 1500, and didn't adjust the training set, these are what I got back.

The ai is capable of generating concepts not in the four images and while some of the motoko images have the problem that some look similar to the original images, thats due to having a such a small training set. In a large dataset overfitting problems shouldn't really be an issue.

The reason why the lab gets these overfitted results on the other hand is because they literally train the ai on copies of the same image.

So more or less I think saying that the AI is copying is inappropriate since more accurately the lab is pushing the ai to generate copies for results. It would be way more accurate is they used a base SD model generated a bunch of images and tried to find copies in that bunch. Instead of forcing the ai to make a bunch of copies. It's like getting mad that your images is on Library Of Babel when it's at a basic level a infinite random generator that has infinite combinations and that one of those combinations just so happened to be a copy of your Image. (This isn't how library of babel actually works, but I don't want to go into detail rn)

No, cause that is not how the ai works. The ai doesnt save images as data or latent space, it's not looking up images like Google, it's making them of 'imagination' (said imagination is equations, and prompt values, but in simple terms the ai does kinda 'imagine' the image.

I understood how it works lol but thanks

Again

"Youre honor it just imagined the image it was totally random" would not protect it from copyright law. If it just happens to produce a copyrighted image and said image is then used without the copyright holders consent its still a violation. Sure artists can't claim it's directly "stealing" from them but if it just happens to produce a Superman image in the style of any of dc's artists and someone goes on to use that in any official capacity such as advertising you can bet your ass DC's lawyers are going to hit you with the axtual anti life equation

Xhominid The Apex · Feb 1, 2023

Masterblack06 said:
TikTok - Make Your Day

www.tiktok.com

Welp there goes SeththeProgrammer

Masterblack... time for you and your podcast to fill the void...

Gordo · Feb 1, 2023

If you’re using AI art, use it for yourself. I wouldn’t try and make a profit off of that and sell it to others

OBD Convo #26: Forth Eorlingas!

Acclaimed

Acclaimed

Exceptional

Luminous

Marvelous

Marvelous

Luminous

Marvelous

Luminous

Illustrious

Luminous

Illustrious

Illustrious

Exceptional

Man of Atom

Exceptional

Luminous

Marvelous