Writing on my an early prototype of Simulacrum Orbita

Nikos Antonio Kourous Vázquez,
Simulacrum early loop image

The capabilities of machine learning models are fundamentally bound by the scope of their training datasets. Earlier on in the development of machine learning, when training models was expensive and datasets were narrow, the limitations of machine learning were stark. During this phase, it was easy to scrutinize and marginalize machine learning as a tool, void of intelligence. As the ability to train machine learning models increases and as datasets grow, however, their functionalities multiply, shifting our perspective; newfound senses of creativity and problem solving begin to transform machine learning models from mere tools to intelligent agents. This new perception of artificial intelligence poses the risk of over mystifying machine learning and turns it into something magical and unexplainable and, therefore, un-scrutinizable. AI is on track to reshaping industries and society and needs to be scrutinized and needs to be deconstructed. To achieve this, the informational gap between sophisticated machine learning technology and the general public needs to be closed. The inner-working of artificial intelligence must be revealed and explained, demystifying its inner workings through education and critical analysis. By doing so, these technologies can once again be seen as program-based tools, capable of being dismantled and reworked rather than as omni-potent beings, beyond human intervention.

The aim of this project is to deconstruct and exhibit the inner workings of machine learning, allowing one to observe and scrutinize its built-in limitations. My project consists of an endless feedback loop between a machine learning model capable of describing an image with text, and another capable of generating images from text. It begins with an initially generated image, say, of "a person eating ice cream." The image captioning AI then attempts to describe this initial image. It captions the image of a girl eating ice cream as "a girl is eating a muffin." This caption is then used as the next prompt, fed to the image generation model which instead, generates a new image, slightly off-target, and therefore captioned differently as "a young boy is eating a hamburger." Since the image generation model is slightly outdated, it's incapable of accurately representing the text prompts, confusing the subsequent caption-prompt generation. Equally misleading, the captioning AI will always generate a description, no matter how vague or realized the generated image is. Like a game of telephone, the exchange between these two models quickly becomes convoluted and strays away from the initial prompt. As this process repeats itself, patterns begin to emerge, unveiling how the image and caption models work, what they're trained on, and the biases they hold.

The caption "two men in suits and ties standing next to each other" subtitles a vague image of two white men in black and blue suits. The next picture reads "two men in suits pose for a picture" – again, under a red-carpet-esque image of two white men in suits. This particular series began with the caption "a person is cooking in a kitchen" and traveled through images of vegetables on cutting boards, people sitting and eating at tables, and women in red dresses until arriving at men in suits. Once at men in suits, however, the program begins an endless cycle where it is constantly able to produce images of men in suits so accurately that the image description model is always able to describe them as men in suits without room for variation. This pattern occurs over and over again, no matter the initial prompt or subsequent journey. Some runs fall into the white men in suits loop after 200 images, while some take as little as 4 iterations to succumb to it. As the images of men in suits generate, it is easy to notice how although the text prompts don't suggest any specific age or ethnicity, the images of men in suits are almost always middle to older aged white men. Generally, most of the people generated throughout the series are white. In addition to the white men in suits, a frequent feedback loop seen throughout multiple series are pictures of women in red dresses.

The way the two machine learning models are set up, constantly cycling between each other, brings out and reveals the built-in flaws and biases in both models. For the image generation model, the accuracy of the generated images depends on the quantity and quality of its dataset (Longo 2023). The image generation model is consistently able to produce accurate images when the subject is found commonly in its dataset while uncharted regions of its dataset produce vague and unrealistic renditions. At the same time, the possibility of generated prompts are limited to the known descriptions in the caption generation model's dataset. This dynamic between the two models means that the line of miscommunication between the two continues until they both simultaneously reach saturated peaks within their datasets. The nature of images and captions in these loop states and the frequency they enter them therefore reflect what both models are inherently biased towards and the data that makes up their training data. With this in mind, the patterns and loops observed throughout runs, specifically those regarding humans, reveal heavy biases towards age, ethnicity and gender. When prompts are vague and refer to the subject as "a person," the subsequent image generation and captioning assign their own age, ethnicity, and gender, likely suggested by the specific action or description. "A person is cooking in their kitchen" becomes "a woman is sitting at a table with a plate of food" and "a woman in a red jacket standing in snow" becomes "a man dressed in a red and white suit and red hat." The two models regularly assign attributes to people reaffirming demographic stereotypes and gender norms.

The capability and effectiveness of machine learning models are fundamentally based in the complexity and size of their training datasets. As machine learning becomes increasingly implemented in systems and tools the demand for larger and larger datasets increases and speeds up (Sharma 2024). To compensate for this, researchers and companies have to look into the past to collect already established sources of information since creating new information would be far too time consuming and unable to provide enough data points (Leffer 2023). This poses a serious question: if building larger, more capable artificial intelligence relies on pre-established knowledge from the past, how can we push forward new, modern perspectives and beliefs? How do we prevent artificial intelligence from repeating and reinforcing norms from the past that have already been broken? Human-led and generated media is more capable of breaking away from conventions as we're able to adapt, stay up-to-date, recognize forms of marginalization and find ways to break them and establish new, modern, more inclusive media. As artificial intelligence continues to proliferate, specifically in media with image, video, and text generation, it becomes imperative to develop methods and frameworks that allow AI to transcend outdated biases, stereotypes, and norms ingrained in its training data. Modern society is built on struggle and of marginalized people who have fought for equal rights, inclusion, and representation. Artificial intelligence, with its ability to revolutionize and transform modern systems, should not backtrack on this progress, and instead, continue the fight for a more equal and free society.

Written while developing prototype versions of Simulacrum Orbita (2025)

References

Leffer, L. (2023) 'Your Personal Information Is Probably Being Used to Train Generative AI Models', Scientific American, October 19. Available at: https://www.scientificamerican.com/article/your-personal-information-is-probably-being-used-to-train-generative-ai-models/
Longo, C. (2023) 'A Data Scientist's Guide to using Image Generation Models', Medium, Dec 25. Available at: https://statistician-in-stilettos.medium.com/a-data-scientists-guide-to-using-image-generation-models-58655f97b6fc
Sharma, R. (2024) Introduction to Deep Learning: Lecture 20, Large Language Models. 11-785, Spring 2024