A Beginner's Guide to IPFS Content Addressing

A Beginner's Guide to IPFS Content Addressing

When comparing the centralized web to the decentralized web, a crucial distinction is how data is identified and retrieved. To visualize this difference, consider two different methods used to provide directions on how to locate a food item in a store.

One method would provide directional instructions, such as:

"Go to the Whole Foods Market at 828 Broadway in New York City, navigate to aisle number 2, walk 10 feet down the aisle, turn left, then locate the yellow pasta box 16 inches from the floor to the right of the 3rd shelf."

This method is referred to as location addressing since to locate the item, direct locational instructions are used. An alternative method would be to use content addressing, which would provide identifying information based on the item’s contents, such as:

“Purchase a box of Barilla® Orzo Pasta 1 Lb. Box, UPC 00076808513981”

If the goal is to purchase this exact food item, which of these descriptors do you find more clear? Which method provides you the most confidence that you’ve found the exact item intended?

Location addressing refers to identifying data by its location, typically controlled by a specific entity. In this scenario, you are provided specific instructions based on where to look for the food item. his is a common way of identifying data on the centralized web.

Content addressing, on the other hand, provides a unique identifier for the data derived from its content. In this example, the exact food item’s name, plus its unique identifying UPC value were provided to assure that it was clear exactly what item was intended. This approach allows us to retrieve things from multiple sources, such as another grocery store, rather than relying on a single location or entity. This is how data is identified on the decentralized web.

Location Addressing on the Centralized Web

On the centralized web, URLs are the primary means of addressing data. They allow us to create links and connect information on the web, which is crucial for its functionality. However, URLs rely on the location where data is stored, rather than its content.

Although URLs are widely used and many are familiar with how to use them, the contents of a file or piece of data cannot be verified or related to its URL. For example, the URL https://www.puppies.com/husky.jpg may lead users to believe that this URL represents an image of a husky, but there is no way that this can be verified using the URL by itself. The file could be anything.

URLs also indicate which authority is hosting the data, based on the URL’s domain name. Data accessed through location addressing must use a centralized authority to host and store the data. Certain assumptions can be made about these authorities or domains, such as their credibility or the type of data they host, but users cannot be sure of these factors.

On the traditional web, users typically access content by visiting a specific domain, such as puppies.com, to find the file stored at a certain URL, like huskies.jpg. However, if the domain becomes unavailable, users lose access to that image.

Relying on centralized authorities also comes with another risk: reliability. Data stored on centralized providers are prone to widespread outages, natural disasters, or data being inaccessible for something as minor as a broken network cable. That’s because data is siloed in one geographic location, and can only be accessed through its location URL.

Ultimately, the location-based address of a file on the centralized web comes with several limitations and concerns. If a user wants to find a particular file, such as a picture of a dog, they cannot guess its URL based on the content alone. Also, they cannot determine the reliability of the domain hosting it and may be subject to data inaccessibility at any time, nor can they verify that the content stored is validated or verified to contain exactly what is expected.

Due to the widespread reliance on central authorities and the risk of human error when tasked to accurately label content, it's easy for malicious actors to deceive users about what's really at a particular URL.

Additionally, it's not uncommon for hundreds or thousands of people to store the exact same image of a dog, each using a different filename and domain, which can result in a lot of duplication. Even on personal computers, users are likely to have saved identical files with slightly different names or versions. The web is a jumbled collection of data saved multiple times at different URLs, and it's difficult to determine which items are duplicates.

Content Addressing on the Decentralized Web

As previously discussed, the centralized web relies on trusted centralized authorities to host data that uses location-based URLs to access it. However, there's an alternative: the decentralized web. On the decentralized web, users can all host each other's data and use a different form of linking that's more secure, verifiable, and prevents mass duplication.

One of the most important tools on the decentralized web is cryptographic hashing, which enables the method of data storage and retrieval known as content addressing. It also frees users from relying on central authorities to host data and keep it online. Hashing takes data of any size and type and returns a single, fixed-size "hash" that represents it. Although these hashes may not be very descriptive, they're more secure because:

  • Cryptographic hashes are derived from the content’s data itself, meaning that everyone using the same algorithm to store the same data will recieve at the same hash. This allows us to confirm that two pieces of data are identical simply by comparing their hashes.
  • Cryptographic hashes are unique. If someone modifies a piece of data or its metadata, even by a single character, the resulting hash will be completely different, allowing us to easily detect any changes.‍

On the decentralized web, users all participate in hosting each other's data, and content addressing empowers us to have confidence in the shared information. Even if users lack knowledge about the hosts, hashes provide a safeguard against malicious actors that may deceive users regarding the file contents. This is why cryptographic hashing is a critical component of the decentralized web.

Using Content Addressing

Compared to the centralized web, the data retrieval process on the decentralized web is different. To retrieve a particular photo of a cute pet, users request it by asking for its content address (CID) or hash, and they don't need to rely on a specific domain.

A CID (Content Identifier) is a unique form of content addressing utilized on the decentralized web. Although it was originally created for IPFS, several other protocols and information systems use the CID standard created by IPFS. A CID is comprised of a cryptographic hash and a codec, which provides information about how to decipher the encoded data. Codecs encode and decode data in specific formats.

Alternatively, users can upload their own data files to IPFS and in return be provided with the file’s CID value. They can then share that CID value with someone else, who can query IPFS for that CID.

When a user asks for the CID, the entire network is queried. If one peer has the content that the user requested, they can be sure it's the correct file because it matches the expected hash. Even if a data host is offline, the file can still be accessed from another peer host.

Many formats and protocols already employ content addressing, but they may differ in how they interpret data and which cryptographic algorithm they employ for content hashing. CIDs enable us to establish a universal identifier for data stored using any of these systems.

IPFS Pinning

To assure that CIDs are always available, they must be pinned to an IPFS node on the network.

IPFS pinning refers to the process of storing a file or folder within an IPFS node’s permanent storage instead of in the node’s cache storage. Unless a file is pinned, it’s stored in cache storage that is periodically cleared by the network’s garbage collection process.

Filebase is a geo-redundant IPFS pinning service and decentralized storage provider. When a file is uploaded to an IPFS bucket on Filebase, it is automatically pinned to the IPFS network with 3 duplicate copies, each of which is stored on an IPFS node located across 3 unique, geographic regions.

You can sign up for a free Filebase account to get started with your IPFS journey today.

If you have any questions, please join our Discord server, or send us an email at hello@filebase.com.