Use AI to create a story from an image.

Every image has a tale to tell, but what if we could unveil those hidden narratives with the power of artificial intelligence? We’re here to explore the intersection of computer vision and natural language processing. This tutorial delves into the practical application of Hugging Face AI models to turn images into captivating stories.

As we navigate this tutorial, you’ll discover how AI can discern the subtle nuances within an image, extract meaningful information, and weave it into eloquent narratives.

Understanding Hugging Face Models

Hugging Face, a prominent name in the realm of artificial intelligence, is your gateway to a wide array of pre-trained language models and transformer architectures. Before we delve into the mechanics of creating stories from images, it’s crucial to understand why Hugging Face models are pivotal for this task.

Hugging Face stands out for its versatility. The platform offers access to various pre-trained models that can tackle diverse natural language processing tasks. These models are widely recognized and adopted across a multitude of applications.
Hugging Face adopts an open-source and community-driven approach, allowing AI enthusiasts, developers, and researchers to contribute to model development and share their innovations. This collaborative spirit ensures that the models continuously evolve and improve.
Hugging Face models are designed for user-friendliness. They provide straightforward and intuitive APIs, catering to both experienced developers and newcomers to the world of AI. Whether you’re an expert in Python or just getting started with coding, working with Hugging Face models is relatively straightforward.
Hugging Face models have set benchmarks in various natural language understanding and generation tasks. They are renowned for their outstanding performance, making them a preferred choice for achieving high-quality results.

Components for Image-to-Story Conversion

In the journey to create stories from images using Hugging Face AI models, we’ll be breaking down the process into three crucial components. These components form the backbone of our image-to-story conversion pipeline.

Our first step is to teach the machine to understand the scenario presented by an image. This process, known as image captioning, involves translating visual content into text descriptions. It’s the foundation upon which we build our storytelling magic.
Hugging Face adopts an open-source and community-driven approach, allowing AI enthusiasts, developers, and researchers to contribute to model development and share their innovations. This collaborative spirit ensures that the models continuously evolve and improve.To achieve this, we’ll leverage computer vision models that can analyze the image’s content and generate descriptive text. These models have been trained on vast datasets, allowing them to recognize objects, scenes, and context within images. The result is a textual representation of the image, which serves as the starting point for our storytelling journey.
With the image content converted into text, we move on to the storytelling phase. Here, we introduce the star of the show Large Language Models (LLMs). These models are the creative minds behind the stories we’ll be generating.LLMs are massive neural networks that have been trained on extensive text corpora, making them masters of language. They can take a textual prompt, like the description of an image, and continue the narrative with engaging and coherent text. In our case, they will transform the image caption into a short story.The beauty of LLMs lies in their ability to generate contextually relevant and imaginative text. They provide a canvas upon which we can paint captivating narratives, all powered by the magic of artificial intelligence.
Our storytelling journey doesn’t stop at written words. We want our stories to come to life, and that’s where text-to-speech models enter the scene. These models can convert our written stories into spoken words, generating audio that adds an extra layer of immersion to the experience.Text-to-speech models have made remarkable strides recently, offering human-like voice synthesis. They take the text output from our language model and turn it into an audio file that can be played and enjoyed. With this component, we bridge the gap between written and spoken storytelling, bringing our narratives to the ears of our audience.

Practical Example

For the first part image to text, we are going to use the model:Salesforce/blip-image-captioning-large

The Salesforce/blip-image-captioning-large model is a powerful tool for image captioning. It’s been pre-trained on extensive datasets, making it adept at recognizing objects, scenes, and context within images. This model serves as our bridge between the visual and textual worlds.

You’ll need a Hugging Face account and a token for accusing the API.

First create a .env file and put your Hugging Face token there:

HUGGINGFACEHUB_API_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

Create an app.py file where we will be working:

from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
load_dotenv(find_dotenv())

# image to text
def image_2_text(url):
    image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
    text = image_to_text(url)
    print(text)
    return text

image_2_text('photo.jpg')

When executing this, you should be able to see a small description of the photo.

[{'generated_text': 'a castle with a tower'}]

For the second part, we’ll use the model to generate a short story based on the text:HuggingFaceH4/zephyr-7b-alpha

Add to the app.py file:

def generate_story(scenario):
    pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.float16, device_map="auto")

    messages = [
        {
            "role": "system",
            "content": "You are a story teller; You can generate a short story based on a single narrative, the story should be no more than 30 words long",
        },
        {"role": "user", "content": str(scenario)},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(outputs[0]["generated_text"])
    return outputs[0]["generated_text"]

In a far-off land, stood a castle grand,
With a tower so high, it touched the sky's command.

For the last part Text to Speech, we are using:espnet/kan-bayashi_ljspeech_vits

def substring_after(s, delim):
    return s.partition(delim)[2]

def text_to_speech(message):
    API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
    headers = {"Authorization": f"Bearer {HUGGINGFACEHUB_API_TOKEN}"}
    payloads = {
        "inputs": substring_after(message, "<|assistant|>")
    }
    response = requests.post(API_URL, headers=headers, json=payloads)

    with open('audio.flac', 'wb') as file:
        file.write(response.content)

The resulting audio sounds like this:audio

To execute the code via Streamlit, here is the complete code, or in our repositoryhttps://github.com/uokesita/ImageToStory

from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
from langchain import PromptTemplate, LLMChain, OpenAI
import torch
import requests
import os
import re
import streamlit as st

load_dotenv(find_dotenv())
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
# image to text

def image_2_text(url):
    image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
    text = image_to_text(url)

    # print(text)
    return text

# text to speech
def generate_story(scenario):
    pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.float16, device_map="auto")

    messages = [
        {
            "role": "system",
            "content": "You are a story teller; You can generate a short story based on a single narrative, the story should be no more than 30 words long",
        },
        {"role": "user", "content": str(scenario)},
    ]
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    print(substring_after(outputs[0]["generated_text"], "<|assistant|>"))

    return substring_after(outputs[0]["generated_text"], "<|assistant|>")

def substring_after(s, delim):
    return s.partition(delim)[2]

def text_to_speech(message):
    API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
    headers = {"Authorization": f"Bearer {HUGGINGFACEHUB_API_TOKEN}"}
    payloads = {
        "inputs": message
    }
    response = requests.post(API_URL, headers=headers, json=payloads)

    with open('audio.flac', 'wb') as file:
        file.write(response.content)


def main():
    st.set_page_config(page_title="Use AI to create a story from an image")
    st.header("Use AI to create a story from an image")

    uploaded_file = st.file_uploader("Choose an image", type="jpg")

    if uploaded_file is not None:
        print(uploaded_file)
        bytes_data = uploaded_file.getvalue()
        with open(uploaded_file.name, "wb") as file:
            file.write(bytes_data)
        st.image(uploaded_file, caption="Uploaded Image", use_column_width=True)
        scenario = image_2_text(uploaded_file.name)
        story = generate_story(scenario)
        text_to_speech(story)

        with st.expander("scenario"):
            st.write(scenario)
        with st.expander("story"):
            st.write(story)

        st.audio("audio.flac")

if __name__ == '__main__':
    main()

You can run it with:

streamlit run app.py

Examples and Use Cases

We’ll explore real-world examples and use cases of image-to-story conversion using Hugging Face AI models and the Salesforce/blip-image-captioning-large model.

Imagine you’re a social media manager for a travel company. You can use image-to-story conversion to automatically generate captivating captions and stories for your travel photos. This saves time and increases engagement with your audience.
Image-to-text conversion has profound implications for accessibility. By providing textual descriptions of images, we make digital content more inclusive for visually impaired individuals who rely on screen readers to access information.
Content creators and marketers can leverage image-to-story conversion to breathe life into their visual content. Whether it’s generating product descriptions from images or crafting engaging narratives for advertising campaigns, the possibilities are vast.
Educators can use this technology to create educational materials that bridge the gap between visual and textual content. Images in textbooks, for example, can be accompanied by automatically generated explanations.
Journalists can use image captioning to briefly summarize the content of news images, making it easier to search and categorize visual content in news archives.
In the realm of art, this technology can be used to generate artistic descriptions for paintings and visual artworks, adding a layer of interpretation to the viewer’s experience.

Conclusion

As we wrap up this journey into the fascinating world of image-to-story conversion, let’s take a moment to recap what we’ve learned and the possibilities that lie ahead.

With the help of Hugging Face AI models like, we’ve unlocked the power to transform images into captivating stories. This technology empowers us to unleash our creativity and bring stories to life in ways we couldn’t have imagined before.
Image-to-text conversion has a significant impact on accessibility, content creation, marketing, education, journalism, and more. It’s a versatile tool that opens up new possibilities in various fields, making it easier to bridge the gap between visuals and text.
The world of AI and natural language processing is continuously evolving. As you embark on your own image-to-story adventures, remember that innovation knows no bounds. Explore, experiment, and push the boundaries of what’s possible with AI.

Let’s build together

We combine experience and innovation to take your project to the next level.