Turn any Wikipedia article into a video, automatically

TLDR: I built a tool to automatically generate videos from Wikipedia pages. Check the video below or the YouTube channel if you want to see the end result or keep reading for an overview of how it works.

Automatically generated video about Florence

I’ve always been interested in automatic content creation, be it text or pictures (e.g., fractal art and similar). Just very recently, though, I found myself brainstorming ever so often about what kind of videos could be automatically generated with a decent enough quality (for a very generous definition of decent). After several ideas that were mostly too complex to implement in a reasonable amount of time, I realized that there could be a way to create engaging videos automatically, based on the content of a Wikipedia page.

Wikipedia is perfect because every major topic is usually covered in great detail, and there are hundreds of such topics so that our approach can be tested in a variety of settings.

The basic idea is to create a video that will illustrate the content of the Wikipedia page: the audio track will be made of its text while the video part will be a slideshow of pictures of the topic at hand, extracted from Wikipedia and other sources. As I started experimenting and building the system, I came across several issues and possible improvements to this basic concept, so keep reading if I’ve piqued your interest!

1. Design of the system

Our only requirement is that the system will work with any Wikipedia page, with no more user input than the title of the page itself. It will output an mp4 file with the final video with no human intervention. The system will work in the following way:

  1. Get a Wikipedia page in input and retrieve its content;
  2. Parse it to define the structure of the video: we want to divide the video in multiple sections following the structure of the page;
  3. Retrieve any other additional information from e.g., Wikidata and pictures/videos from Pixabay;
  4. Build a storyboard that will show us how the final video will look like, before rendering it;
  5. Render the audio with a text-to-speech engine (I used Amazon Polly);
  6. Render the video with ffmpeg.

Here are some pieces of tech that I used to achieve this:

  • Wikipedia APIs: used to retrieve the content of the Wikipedia page;
  • Wikidata APIs: used to extract additional information about the current page and displaying it in the video;
  • Pixabay APIs: free, open-domain image/video search APIs, to retrieve related content to show;
  • Amazon Polly: text-to-speech engine to render the audio;
  • ffmpeg: used for rendering the video itself.

I built everything with Java, so a bunch of other libraries have been used (e.g., Jsoup for HTML parsing) but I will not go into language-specific details and rather show the system at a high level.

1.1 Retrieve Wikipedia page content

In itself, retrieving the content of a page is very easy with Wikipedia APIs. The only thing we need to take into account is that we will need it for two purposes:

  1. Read the text out loud using TTS;
  2. Extract the structure of the page (read next section to know why)

We can get the content in three different formats: plaintext, Wiki markup language or HTML. The plaintext version is definitely the one we will use for the TTS engine, as we do not want it to read any markup. As for extracting the structure, parsing Wiki markup language is hell, so we’ll go with HTML. Thus, for each page, we will make two requests: one to get its plaintext and one for the HTML version.

1.2 Parse Wikipedia page content and define video structure

Every Wikipedia article is divided in sections and sub-sections. As the name implies, a section can contain other sub-sections, as it happens for “History” in the following screenshot (“Earliest history”, “Legend of the founding of Rome” and “Monarchy and republic” are all sub-sections of “History”).

Wikipedia page for Rome with highlighted sections

Our video will be divided into section that match those displayed in the page. In particular we will only take into account sub-sections. When the text for each sub-section is being spoken, we will want to display its name in the video.

For this reason, we want to separately extract each section and its text, rather than having just one huge blob of content. Besides this, we also want to extract images and link them to a specific sub-section, in order to show pictures that are more relevant to what is being talked about. Last but not least, some Wikipedia pictures have a caption that we will want to display as well, instead of just the bare picture.

All of this stuff can be done by parsing the HTML version of the page. At the end of this process we have a list of sections with text and pictures (possibly with captions).

Note: Wikipedia pages can be very long, resulting videos that are more than 1 hour long. To avoid this I also applied a very simple text summarization algorithm that removes some sentences until the result is of a desired length.

1.3 Retrieving additional content from Wikidata and Pixabay

Wikidata allows us to extract structured information for Wikipedia pages. If you are not familiar with Wikidata, just check its Rome entry to see what kind of information is available. It’s a lot of stuff!

We can use this to enrich our video with additional information. As an example, we are going to retrieve the flag of an entity and show it in the video as an overlay. Clearly, this will not show anything for entities that do not have a flag image property, but we can easily implement more properties using the same process.

Unfortunately, pictures from Wikipedia are often not enough for a video: many times they are not the best quality and several pages just have a few of them, so our video would basically consist in 3/4 pictures that last 2 minutes each. Pretty boring, huh? Fortunately, Pixabay has a fantastic API that allows us to retrieve free, open-domain pictures and videos based on a keyword search. With this, we can considerably increase the amount of pictures at our disposal and also include videos as well!

We will query Pixabay API and retrieve pictures for each section that we have extracted at the previous step. To try and get more relevant pictures, we will include the section title in the search query. For example, we will search for “Rome History” instead of just “Rome” when retrieving picture for the “History” section. If there are 0 results, we will use those for “Rome” instead. The same process is applied to videos.

1.4 Build a storyboard

We have a bunch of picture, videos and Wikipedia page sections. At this point, we can visualize what the video will look like by building a simple HTML page that will give us a preview.

An example storyboard. Can you guess the city from the pictures? Images with a blue border actually represent videos.

1.5 Text-to-speech with Amazon Polly

This has really been one of the simplest step in the process. For each section, we send its text to Amazon Polly and save the output to an mp3 file that will be used as audio track. I tried several type of voices and I decided that the neural voice with a “conversational” was the one that sounded best.

To enable this setting, you need to send SSML instead of plaintext in your request. It will look like this:

<speak>
    <amazon:domain name="conversational">Hello, this is a test!</amazon:domain>
</speak>

Just take care of escaping the text correctly if you don’t want to receive an InvalidSsml exception.

1.6 Rendering the video with ffmpeg

Now comes the trickiest part. Having no prior experience with ffmpeg I had to rely heavily on a series of very valuable StackOverflow questions.

What the system does is basically generate a very long ffmpeg command that will create the video based on input images, videos and audio tracks. This includes combining everything from the flag overlay that I mentioned earlier, the actual pictures/videos and the text overlays, like the name of the current section.

I’ve also managed to create simple effects like fade in/out between images and slight zoom motion for images.

2. Making the videos YouTube-ready

Now that we have are videos ready, we want somebody to see them. But, after all we went through to generate the video automatically, we really do not want to upload it manually to YouTube, having to fill all the metadata and choose a thumbnail image.

For this reason, the output of the program also includes a JSON file that contains information about the video. Most of this metadata is needed for YouTube APIs, such as a video title, description, tags, etc… The system will take care of automatically generating it, for example by choosing the first few sentences of the video as its description.

YouTube videos can also include coordinates of the geographical location they are filmed in, so we automatically extract and include those, if available, from Wikidata.

The only thing that’s left is to generate a thumbnail image. This and the upload to YouTube parts have actually been implemented by a friend of mine who joined me in this project.

His code will read the title (e.g. Rome) and subtitle (e.g. History, the topic of the video) fields in the JSON file and create a nice looking thumbnail image, by overlaying them to an picture taken from the video itself. This part it’s written in Golang and uses disintegration/imaging and fogleman/gg libraries. It will also take care of uploading the video to YouTube with all the metadata contained in the JSON file.

3. Conclusions

Of course, there are a lot of quirks that I didn’t mention in the article to keep it short and to the point (e.g., choosing the duration of the images according to the audio duration, avoid zooming on images that have a caption otherwise it goes out of the viewport, etc…). I didn’t want to go into probably boring implementation details, but if you have any question feel free to leave a comment below!

Discovering Wikipedia edits made by institutions, companies and government agencies

TLDR: I built a tool to search Wikipedia edits made by organizations, companies, and government agencies. The tool is temporarily down, but you can read this article to get an overview of how it works. You might find some broken links in the article for this reason.
You can find some hilarious stuff (link)

A couple of months ago, an idea came to mind of analyzing Wikipedia edits to discover which public institutions, companies or government agencies were contributing to Wikipedia, and what they were editing.

After a quick Google search I realized that it had been done before, but the service, called WikiScanner, had been discontinued in 2007. After WikiScanner, the idea surfaced again several years later: in 2014 the @congressedits Twitter account was created, which automatically tweeted any Wikipedia edit made by IP addresses belonging to the U.S. Congress. The account was eventually suspended by Twitter (read why here). The code for this bot was released under a CC0 license on Github, and several other bots were created, looking for edits from different organizations.

At the moment of this writing, some of these systems are still active on Twitter (e.g., @parliamentedits), but my interest was in building something that allowed users to search and navigate edits efficiently (rather than just having a stream of tweets), and that was not limited to monitor a single organization. I decided to dedicate some of my free time to this side project and build Wikiwho.

Methodology

I decided to use the same approach that WikiScanner used, which is to identify organizations based on their IP address. This is possible because, when somebody edits Wikipedia without being logged in, his IP address will be logged instead of his username. It follows that this system is not able to identify edits made by logged in users, which are simply discarded during a pre-processing step.

Each log entry contains metadata about the edit, e.g., time, IP address, edited page, but does not provide any information on what was actually edited. This already allows us to perform some analysis, e.g., which organizations are the most active, which pages have been edited the most and by who, which periods of time saw the most activity, etc…

However, this doesn’t give us any insight as to what was actually changed in the content of the page. To achieve this we need to use Wikipedia’s APIs in order to retrieve, for each edit, the diff with respect to the previous version of the same page. This way we can see what text was actually added or removed.

At this point, we have a huge amount of diffs with full information. To put this all together and allow an easy navigation I built a web interface on top of this data. The stack I’m using is Java, MongoDB for storing most of the data, and Lucene to enable full-text search. The following contains an overview of my implementation at a very high level.

1. Determining IP ranges of interest

Since we are going to identify organizations based on the IP address, we need a mapping between IP address ranges and organizations. To avoid compiling this list myself, I have used IP ranges available at [1] and [2] (I may have used some other source as well, but I actually am not sure at the moment since this was a few months ago). It goes without saying that I cannot vouch for the accuracy of these lists. I’ve made a small amount of minor edits where I found obvious mistakes, but nothing more than that.

However, the final interface provides a way to quickly check if the information about the IP address is wrong:

The IP button will open ipinfo.io, where you can see information about the organization the IP address belongs to

2. Parsing Wikipedia dumps

Wikipedia edit history is released periodically in an XML format. There are several types of dumps: the one we are interested in is enwiki-$DATE-stub-meta-history.xml.gz, which contains the full edit history (only metadata).

The key elements in these dumps are revision nodes: for each revision, we will use its unique ID (to later retrieve its content with Wikipedia’s APIs), the IP address of the contributor and the timestamp to allow analyzing edits by date ranges. The XML for these nodes looks like this:

<revision>
  <id>186146</id>
  <parentid>178538</parentid>
  <timestamp>2002-08-28T05:53:36Z</timestamp>
  <contributor>
    <ip>216.235.32.129</ip>
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="9959" id="186146" />
  <sha1>mau31t4o85dh1ksu79r1xvgs2dujr8c</sha1>
</revision>

You’ll need to use an event-based XML parser as the file is too big to fit in memory. We can iterate over the dump and extract a list of edits where the IP belongs to one of the ranges that we have defined at the previous step. I discarded revisions where the contributor field contains a username or a IPv6 address.

3. Calling the Wikipedia APIs

We still need to get the actual content of the diff. The request to Wikipedia’s APIs looks like this:

https://en.wikipedia.org/w/api.php?action=compare&fromrev=27272&torelative=prev&prop=diff|size|ids|title&format=json

fromrev is the ID of the revision we are interested in. Then, we can either specify another revision ID that we want to compare our revision with (using torev) or, if we want to compare with the previous revision, use torelative=prev. This way we do not have the hassle of retrieving the ID of the previous revision ourselves when parsing the XML dump. The prop parameter simply defines the fields we want in our output. This is what you get as a response.

The diff field contains an HTML-formatted version of the text, in which additions and deletions are marked by specific tags. It is basically what is shown if you open the diff in the Wikipedia comparison tool. After a few HTML parsing steps, we are now aware of what was actually added or removed. This is not really a necessary step, as we could have simply indexed the full text of the diff, bu doing so allows us to perform full-text search specifically for added or removed text.

4. Putting it all together

At this point we have a huge dump of JSON objects containing detailed information about every edit. We know who made it, what changed, and when. We have technically extracted all the information we need, we just need an efficient way of exploring it.

This last part is quite boring and just revolves around processing the diffs, computing stats, storing the info in some sort of database, and build an interface on top of it. To save development time, I decided to use the stack I’m most familiar with: a Java backend, MongoDB for storing stuff and Lucene to enable full-text search. Since I’m not really fond of working with Javascript, everything is rendered on the server side and JS usage is limited to where it’s absolutely necessary.

5. The final result

If you want to take a look, you can see the tool in action here.

It allows you to:

  • Look at the history of edits of a specific page/organization;
  • Look at how edits for a page/organization are distributed over time, and filter for a specific month;
  • Full-text search for removed/added text;
  • Upvote interesting edits (anonymously, no need to sign-up). Most voted edits are shown on the home page, hopefully something interesting will get to the top.

5.1 A note on full-text search

Using Lucene allows to perform some advanced queries to bring up more relevant results. Here’s a non-exhaustive list of the possibilities:

  • "exact phrase search": use quotes to look for an exact phrase;
  • +wikipedia -google: use + and - operators to include/exclude particular words;
  • wiki*: use wildcard operators;
  • "wikipedia google"~N: proximity search: look for two words at maximum distance N (in number of words).

6. Conclusions

From my own exploration, most of the stuff you can find is just vandalism or trolling, and I highly doubt that you will find anything of relevance. Some of the edits are really hilarious, though!

Overall, this was a really fun project to implement and I hope you’ll have some fun exploring the system and sharing your finds with everybody else.

See you!