Notice: This repo is old, and is not actively being updated. However, if you're keen to edit, please do so!
Helping me understand what IPFS is and how it works
This is a document where I will define and explain terms that I don't understand about IPFS. This is not necessarily the best place for answering questions about IPFS; if you have questions that you want answered by knowledgeable people, I'd suggest either ipfs/ipfs, ipfs/faq, ipfs/support, or ipfs/go-ipfs. This document is for me; it is where I will ask questions I do not know the answer to, and search to answer them on my own, providing non-technical answers as much as possible.
My goal is to eventually make a textbook on what IPFS is, how crypto works, what the distributed net would look like, and how to get caught up and knowing enough to be able to contribute on one's own.
tl:dr; Explain IPFS like I am 5.
If you have questions about IPFS, too, and you want a place to get a really in-depth answer about what things are, add those questions here by opening an issue. If you know answers and don't mind me (or other contributors!) asking long, exhaustive questions about what the words you use mean, please answer questions!
We will eventually need to split this into chapters and sections. For now, feel free to dive in. Work should mainly happen in the issues; once an issue is answered, summarize it and add it to this README in a PR.
Note: This is very much a work in progress! Currently switching from PR to Issue question and answer.
Descriptive text from the homepage
The InterPlanetary File System (IPFS) is a new hypermedia distribution protocol, addressed by content and identities. IPFS enables the creation of completely distributed applications. It aims to make the web faster, safer, and more open.
IPFS is an open source project developed by the team at Interplanetary Networks and many contributors from the open source community.
IPFS is a peer-to-peer distributed file system that seeks to connect all computing devices with the same system of files. In some ways, IPFS is similar to the Web, but IPFS could be seen as a single BitTorrent swarm, exchanging objects within one Git repository. In other words, IPFS provides a high throughput content-addressed block storage model, with content-addressed hyperlinks. This forms a generalized Merkle DAG, a data structure upon which one can build versioned file systems, blockchains, and even a Permanent Web. IPFS combines a distributed hashtable, an incentivized block exchange, and a self-certifying namespace. IPFS has no single point of failure, and nodes do not need to trust each other.
Answered in the ipfs/ipfs repository, here:
The original name was GFS, which stood for the Global File System and seemed more accurate than GitFS. But that exact name was already taken. So I switched Global for Galactic, in an homage to Licklider's Intergalactic Computer Network, and because peer-to-peer systems look like galaxies to me. But GFS caused confusion with yet another GFS, even though that one is not even open source.
By popular demand (there were votes and a pronouncement of "GFS is dead. Long live IPFS!" and everything), I switched it to IPFS - InterPlanetary File System, which has several nice properties:
Hypermedia is a non-linear (as in, it can be consumed in any order) medium of information similar to "multimedia", but including hyperlinks. An example is the World Wide Web (Internet + DNS + URLs + Servers) which can offer audio, video, text, graphics and hyperlinks (URLs), just like IPFS can.
A Protocol is a set of rules, typically used to solve one or more problems. In Information Technology, the word Protocol is used to refer to the set of rules, its technical specification and its implementation. An implementation of a protocol is a program or part of a program which conforms to the specification of the protocol. A program may implement multiple protocols, and each protocol has a different purpose.
A distribution protocol is a protocol which solves the problem of distributing data. A networking protocol can be generic or specialize in different use cases or for different kinds of applications.
Let's take a step back.
First, let's take a look at classic URLs, like http://google.com. This URL tells your computer to use the HTTP protocol to access google.com. Your computer then proceeds to use the DNS protocol to figure out where is google.com. When it's done, it creates a connection via the Internet to google.com's IP address. This means that URLs on the Internet address content by location and if the content is moved to another location, the connection won't work and the URL will be useless.
IPFS addresses data by content which means that IPFS URIs (like
/ipfs/QmTkzDwWqPbnAh5YiV5VwcTLnGdwSNsNTn2aDxdXBFca7D/example) tell your computer what object you want and not where to get it. IPFS then figures out how to get the object on its own. This has many advantages:
IPFS can also address data by identity, using IPNS. This means that each computer running IPFS has an identity. Everything that they do is signed using their credentials. IPNS makes it possible to have dynamic content in IPFS, for example a web application which can be updated without changing its address!
This works because the application can be referenced with an IPNS Address which tell your computer the identity of the publisher of the application. IPFS will then figure out the IPFS Address associated to said identity by the owner of the identity (a.k.a. the publisher). The associated address can only be changed by the publisher. If a bad guy tries to impersonate a machine he doesn't own, your computer will not trust him because when he tells your computer the wrong IPFS address for an IPNS name, that message won't be signed using the publisher's identity private key because only the publisher has that. Using public key cryptography, your computer can verify the authenticity of messages coming from other computers, so as long as the private keys of a publisher aren't stolen, you will be totally safe :)
IPFS can be intended as a protocol, a set of protocols or their implementation. So yes, IPFS is also a suite of tools that enable machines and users to take advantage of the protocol.
IPFS can interface with HTTP because each node can expose an HTTP gateway. Now you can tell your machine where the gateway is using HTTP (for example http://gateway.ipfs.io) then you tell the gateway what object you want to get from IPFS or IPNS (for example http://gateway.ipfs.io/ipfs/QmTkzDwWqPbnAh5YiV5VwcTLnGdwSNsNTn2aDxdXBFca7D/example). This way your machine contacts the gateway (at gateway.ipfs.io) and then the machine running the gateway can fetch the data from IPFS for you and reply over HTTP. Of course the path from your machine to the gateway still uses HTTP and not IPFS, but this allows compatibility with applications and systems that don't support IPFS yet.
Gateways could be built for any protocol, but since pretty much everything can interface over HTTP, HTTP is enough for now.
A distributed application is a network application that uses distributed networking to store information and communicate with other instances of the application and thus doesn't require any centralized system or infrastructure to work. An example of a popular distributed application is any BitTorrent client such as qBittorrent or Vuze.
Imagine a classroom with twenty students, their laptops and the professor, who wants to share a 50 MB Video with the class. The professor has many options:
Option 1 and 3 are too cumbersome, but option 2 is so slow! Imagine 20 students downloading a 50 MB file at the same time from Dropbox or Youtube or anywhere else on the Web!
The good news is that IPFS solves all this.
Also when you download a file from the Web, IPFS can and will download it from multiple sources at the same time to speed up the download as much as possible.
IPFS uses content based addressing (you can find out what it means the appropriate answer) which in short means that IPFS links tell your computer what data you want, and not where it is. When you download the data from another node, your computer can calculate the address of the downloaded data on its own, so that if it doesn't match what you requested, it knows it's not authentic.
IPNS records are also signed using public key cryptography, so that your computer can verify on its own that an IPFS publication was actually published by the node with the ID written in the publication.
In short: no one can tamper with the data, and no one can impersonate anyone else unless the victim's private keys were stolen.
Anyone (in this case Bob) can serve and host his websites/files/data using IPFS, even on an old computer or a slow and limited connection. Once many people have local copies of the data on their nodes, they will help seed the data so that other people can download it faster, even if Bob has shut down his node.
Distributed computing or networking refers to a network or computer system where data is distributed on multiple machines.
It goes in contrast with centralized networking, which uses a single or a small group of centralized server machines, and all data goes through them, never directly between clients. This makes it easier to control authenticity and the network in general, but also exposes a central point of failure (the servers) which when taken down render the network/application useless.
Peer to Peer (p2p) is way to organize data exchange in network applications. It lets every node in the network act as both a client and a server. This is always used in at least partly distributed networks, because a centralized network doesn't make any sense combined with peer to peer. Peer to peer applications are complicated because they have to solve many problems:
privacy and authenticity are both solved in IPFS and many other networks using public key cryptography. TODO: how does ipfs handle flooding?
IPFS is peer to peer because every node can use other nodes to get data while also serving data to the network. Of course, a machine may only be used to hold copies of data so that availability is not an issue, but the same machine could also be used by a simple user, enjoying cat pictures and other important content over IPFS, while temporarily keeping a cached copy of his favorite celebrity blog that he read earlier, so that it can be served to other machines.
IPFS is also fully distributed because there is no central point of failure. The only non distributed part of the system is the bootstrap node list. The bootstrap node list is the list of nodes that IPFS connects to when it first starts, because it doesn't know anyone else yet! Then, nodes exchange their peer list, and more and more machines are discovered and connected together, strengthening the network. IPFS can remember nodes it connected to in the past, and the bootstrap list can be extended or changed, so it's almost impossible that your node can't get online. IPFS is also able to automatically find other computers running a node in local networks, so it works even WITHOUT the internet!
IPFS will announce its presence over its local network, so that other computers running IPFS can receive the message and connect to your machine. This means that IPFS is able to automatically build a local network if the computers are all in the same LAN and can communicate. So if a file is requested and in the LAN someone has it, it will still work. Services such as an Instant Messaging service (think whatsapp) built on IPFS will still let you chat with people on your LAN if the Internet is not avaikable.
This also means that IPFS could be adapted to run over any kind of communication channel, for example Bluetooth.
The Internet does make it possible for every computer to communicate (sometimes not directly due to issues such as NAT) but it doesn't mean they automatically do. On the Web, your computers only connects to another if you tell it to, so if you want to see a website you have to tell your computer where the website is.
IPFS is built on top of the Internet and uses it to automatically find other computers running IPFS and automatically connects with them. When you need something, you don't tell IPFS where it is, you actually tell what you want by providing the object hash (think of it like a fingerprint of the file/folder/app you want) and IPFS asks his peers until someone has it, then retrieves it for you. This has many advantages, better detailed in the answer to the What does addressed by content and identities mean? question.
TODO: what does system of files mean?
A blockchain is a specific kind of distributed database, which could also be built over IPFS. You can think of a block chain as a huge stack of papers one on top of another. When you want to make a change to the database, you just add a paper on top that describes what are the changes that you are doing.
This system is useful in distributed networks that share a database, like the Bitcoin network. You can't double spend bitcoins (obviously) but why is that? How is it possible? If they're digital, we can't we copy them and make more?
There are no bitcoins, actually. There is only a blockchain, and when Joe places a signed document on top of the blockchain saying that he wants to move this 50 bitcoins to Amazon's wallet to buy a new video game, it won't work because all the nodes in the network will check if the transaction is valid and of course, by looking at all other documents, it's clear that Joe only has 10 bitcoins based on how many were added and/or removed from his wallet. So the network agrees that the document is not valid and won't accept it in their copy of the blockchain, invaliding the transaction. Meanwhile, Alice successfully moves some funds, so all the other nodes see that the transaction is valid and they copy it on top of their blockchain.
Thus, the chain grows, and the network keeps functioning, even though nodes can't trust each other. No one can impersonate you on the blockchain, because transactions where you send money have to be signed by your wallet using public key cryptography.
Since IPFS is content addressed (see what does content-addressed mean? answer), every copy of the same file produces the same address when added to IPFS. So as long as there is at least one little computer with one copy of the file and a connection to other computers running IPFS, the file will be available. If Joe is hosting his blog from home and the electric company shuts down his electricity because Joe is broke, his fans will still be able to view the website (unless he didn't use IPFS. Way to go, Joe) because someone, somewhere will have a copy of it since they viewed it recently and the address will be the same because the content is the same.
An hashtable is a table with two columns and a row for every object. The first column is the identifier (usually an hash but not always), used to identify the data in the second column. Using this hashtable, nodes can find the correct object when other nodes give them the identifier.
An hash is a little piece of data, like
QmTkzDwWqPbnAh5YiV5VwcTLnGdwSNsNTn2aDxdXBFca7D, that is calculated mathematically from another piece of data, and it represents its fingerprint.
Using public key cryptography, nodes can emit signed messages so that other nodes will be certain that those signed messages weren't being tampered with (modifying a signed message makes signature verification fail). This means that nodes can trust data, even if it's coming from someone that is not the original source, as long as it has a valid signature.
A point of failure in a network is a machine that when shut down, malfunctioning or compromised in any way, gravely cripples the network often rendering it totally useless until the point of failure starts working as it should again. Imagine twitter's server shutting down right now. In a moment, nobody would be able to read tweets, write tweets or use twitter in any way!
IPFS doesn't have central points of failure because it's decentralized (you can learn more in the answers to questions related to decentralization). The only point of falure in IPFS is the bootstrap nodes: they are the list of the first nodes that your computer connects to when running IPFS. However, once your computer finds more nodes by talking with the bootstrap nodes, it will remember them so that even if the boostrap nodes are not available (almost impossible since they are many, independent machines across the globe), the network still works (but new users wouldn't be able to join). This can be solved by using a bigger bootstrap node list and hardening them to make them stronger against network limits, attacks and vulnerabilities.
A node is a machine running IPFS. It could be a standard desktop computer, a laptop, a mobile device, anything that can run programs and connect to a network really. Nodes never trust each other unless there is mathematical proof that a message, statement or other information is to be trusted: you can learn more by searching mentions of public key cryptography in this textbook and read the contextual answer and question. Anyway, using public key cryptography and cryptographic hashes, your node can prove the authenticity and validity of data received from other nodes.