Skip to main content

Steganography: Hiding Data Inside Data

Steganography Programming Github Cybersecurity
Ryan Gibson
Author
Ryan Gibson
Quantitative Analyst | Computer Scientist
Table of Contents

What is steganography?
#

In short, steganography is the art of concealing information within another, non-secret message, much like the use of invisible ink on a seemingly innocuous letter.1 The idea is that you could pass the message through many untrusted carriers, such as the internet, without arousing suspicion from most observers.

A message is hidden inside an image and sent over the internet, confusing an onlooker. The message is then
recovered from the image by the recipient.
Alice transfers a hidden message to Bob over the internet within a seemingly unremarkable image.

In today’s digital age, you may be surprised as to how much data can be crammed into a file without changing it much at all. For example, below are three tiny 220x220 photos of a flower, but

  • One contains the entire, uncompressed text of the United States Constitution.
  • Another contains the entire text of Shakespeare’s Macbeth, containing a little under 20k words.
Three almost-identical images of a pink flower against the blue sky
These tiny flowers look practically identical, but two of them hide thousands of words of information. Can you tell which is which?

We’ll delve into how this is possible in How does (digital) steganography work?, but first let’s explore some real-world examples.

Real-world examples of steganography
#

We’ll start with some physical, non-digital instances.

The EURion constellation
#

One of the simplest methods to hide data is to overlay a pattern in the hopes that it can be recovered later.

For example, many banknotes worldwide contain a precise arrangement of circles designed to allow printers and imaging software to combat counterfeiting operations. This has never been officially publicized, but is informally called the “EURion constellation” and has been integrated into at least ~60 countries’ currencies.

If you happen to have a scanner and some cash on hand, you can try copying one of these banknotes. Depending on the model and brand of the scanner, it might refuse to copy or intentionally corrupt the print by adding stripes across the bill! The one that I own tends to forcibly stop the print halfway through.

The EURion constellation and its presence on an American $20 bill.
Left: the specific pattern of circles in the EURion constellation. Middle: A portion of the back of an American $20 bill. Right: Same as the middle, but with the various constellations highlighted in green.
Three more banknotes with patterns of circles included in the design, in backgrounds and as music notes.
Several other examples of the EURion constellation on British, German, and Euro banknotes. Some are more creative with their inclusion into the design than others.

Printer “Machine Identification Codes”
#

In another covert application of steganography, many color printers use tiny yellow dots that are invisible to the naked eye to overlay a tracking watermark. These encode the serial number of the printer and some date and time information across every printed page.

This is also rumored to be one of the reasons why some printers refuse to print black-and-white documents when they are running low on color ink.

The existence of this technology remained unknown to the public for around two decades as it was developed under secret agreements with various national governments to enhance their forensic tracing capabilities. As a result, it’s been used to track down counterfeiters and whistleblowers across the world.

A close image of printed text under white and blue light.
An image of text printed from a Laserjet printer. Blue light makes the Machine Identification Code visible, consisting of scattered yellow dots that are ~0.1mm wide.

Steganography in video games
#

Game developers also use steganography to identify the author of screenshots or gameplay videos, especially when they include cheating, abuse, or unauthorized use of private servers.

In the 2000s, Blizzard implemented very faint watermarks on screenshots of World of Warcraft which contained repeating patterns of dots across the entire screen. These patterns, developed by Digimarc, encoded various details of the user’s account and the server that they were logged into. Like the other examples above, this screenshot tagging remained entirely secret for the first few years of its existence.

A series of rectangles, filled with random looking dots, regularly spaced across two otherwise blank images.
Two examples of the watermarks used in World of Warcraft, heavily post-processed to reveal the hidden pattern.

Similarly, Microsoft encoded hardware information in the user interface of the Xbox 360’s early builds. Each console’s animations were unique, which allowed the company to crack down on potential leakers. At the time, the employees were under NDAs and would be subject to civil penalties for disclosing nonpublic information about the console’s development.

A tweet that reads "One of the most fun jobs I ever had was figuring out how to embed the serial number of your
Xbox 360 into rings emanating from the bottom right, so we could track and identify leaks", followed by a screenshot of a Xbox 360 menu.

How does (digital) steganography work?
#

In the realm of digital steganography, there are many different techniques, but one of the simplest is “Least Significant Bit” (LSB) steganography.

Basically, the method takes advantage of the fact that most data formats encode information in binary numbers, and the least significant bits of these have the smallest impact on the overall value. By replacing these unimportant bits with a secondary message, we can hide data without making any apparent changes to the file’s original appearance or meaning.

For example, a common image encoding is to store how much red, green, and blue (RGB) is in each pixel with one byte for each color. These values range from 0 to 255 and we can usually change them slightly without most people noticing. Human senses are just far too imprecise to tell the difference, especially when you’re not looking for it!

We can hide the binary string 101 in the LSBs of a 0 red, 163 green, and 233 blue pixel. Afterwards, it becomes 1
red, 162 green, and 233 blue.
An example of how we can replace the LSBs of an image’s pixels to encode a hidden message. The resulting change is nigh-impossible to visually detect.

However, this hidden data can be easily exposed in a “visual attack” where we inspect the LSBs of the image. For instance, if we perform this attack on the three flowers shown at the start of this post, the differences become obvious.

A grid of three flower images on top of three black-and-white noisy images.
Top: the three flower images from the start of this blog post. Bottom: a visualization of the least significant bits of each image.

The original image is on the right, and you can faintly see the flower’s outline in its LSBs. In contrast,

  • The one on the left appears completely random2 since it contained the contents of Hamlet.
  • The one in the middle contained the uncompressed text of the U.S. Constitution and you can visually confirm that the data only takes up the first ~3/4 of the image.

In general, steganographic techniques and their adversarial “steganalysis” counterparts are constantly evolving. More advanced algorithms than this one will minimize changes to the original image’s statistics and would only be detectable with much more sophisticated methods.

On the other hand, this simple technique lets us store a considerable amount of data! This is a direct consequence of the use of binary encoding since the last bit in each byte can only change the color by 1/255 (~0.4%) despite taking up 1/8th (12.5%) of the data itself.

Indeed, in the flower images above we’ve replaced a whole 25% of the actual image data but only altered around 1% of the color information. There is a significant trade-off between the amount of hidden data and the impact on the visual quality of the image.3

A grid of eight test images that get increasingly noisy as the number of LSBs increase.
1 LSB uses 12.5% of data and ~0.4% of color. 2 LSBs use 25.0% of data and ~1.2% of color.
3 LSBs use 37.5% of data and ~2.7% of color. 4 LSBs use 50.0% of data and ~5.9% of color.
5 LSBs use 62.5% of data and ~12.2% of color. 6 LSBs use 75.0% of data and ~24.7% of color.
7 LSBs use 82.5% of data and ~49.8% of color. 8 LSBs use 100.0% of data and ~100.0% of color.
A demonstration of the trade-off between the number of LSBs used to hide data in an image and the corresponding loss of the original color information.

More creative steganography techniques
#

While we’ve provided a reasonable introduction to the basic ideas, there is an abundance of more interesting methods, so we’ll briefly mention some of them here.

  • Text steganography: Messages can be hidden within the formatting, whitespace, or invisible characters of a text itself. Some more intriguing techniques use specific sentence structures or grammatical constructs to impart information. Think of the stereotypical scenario in which you suspect something is amiss when a friend texts you in a particularly unusual writing style.
  • Spread Spectrum: These techniques spread hidden data over a wide range of frequencies, but at a lower amplitude, effectively concealing the covert message beneath the natural noise of the transmission medium. Similar variants are also applicable to images and videos.
  • Audio steganography: In addition to the usual binary techniques, fine manipulation of echoes, harmonics, or the underlying frequency bands can be used to store information.
  • Networking steganography: Many protocols can be manipulated to convey information through calculated usage of (perhaps nonstandard) features, slight manipulation of timing delays between packets, or intentional corruptions that would appear to be typical transmission errors.
  • EOF steganography: End of file markers or headers can be manipulated to hide data outside the intended scope of a file. While not strictly steganography in the traditional sense, it has been repeatedly used in malware and hacking operations, so it is worth mentioning.4

If these topics sound interesting to you, I highly recommend searching the internet and exploring any new techniques that come to mind!

See also and references
#

  • My Python package, stego-lsb, which I used to generate the steganographed images in this post. It also supports sounds files and arbitrary sequences of binary data.
  • A forum post containing details of the steganographic methods used in World of Warcraft.
  • A Hacker News thread discussing the tracking methods used in Xbox 360 NDA beta builds.
  • A Computerphile video on steganographic techniques in images, which includes a discussion of a method for JPEG images that is robust to simple visual attacks.

More generally, consider the following Wikipedia articles.


  1. In fact, steganography comes from Greek word “steganographia”, which literally means something akin to “hidden writing”. ↩︎

  2. Obviously, it’s not actually random since this is just compressed English text. In practice, the hidden data should probably be encrypted in some way that increases the apparent randomness. Otherwise, steganography simply becomes an exercise in security through obscurity↩︎

  3. If the transmission channel is noisy, a certain amount of error correction would also need to be included, which will necessarily decrease the amount of data available for use. This includes electrical interference on the wire, image compression, audio being played through physical loudspeakers rather than in a perfect digital medium, etc. ↩︎

  4. This is sometimes referred to as stegomalware↩︎