A chart is a set of coordinates
When you make a chart you start with an empty, two-dimensional space, a vertical dimension (y) and a horizontal dimension (x) . You also have a data source. Your job is to translate the data into distances and plot data points in a way that their relative distances are kept. You plot the data points using pairs of x,y coordinates:
(This is data on food availability from the US Department of Agriculture. You can get the file here.)
In this example there is only one dimension, so we’ll plot the data along one of the axis and assign zero to the other dimension. We’ll use the y-axis and assign zero to the x-axis.
This is a small dataset, and you may be tempted to believe that there isn’t much to gain with a chart. Maybe not, but this example clearly shows one of the key differences between charts and tables. While in the table the data is sorted by meat source (red meat, poultry, fish) in the chart the data is the sorting key and you see a high-availability group (beef, chicken and pork) and a low-availability group (fish & shellfish, turkey and veal & lamb).
We have a basic visual translation of our data set. We’ll call it the origin chart. Where do we go from here? Obviously we need to annotate the chart: identify the units, the source, the year, add a title. And since we have a lot of available space, we could add more data (for example, add data for the last 50 years). Let’s assume that there is no more data, though.
A chart type is a specific set of transformations we make to the origin chart. These are design changes, not changes to the data itself, but they may have a profound impact in the way we see the chart and the insights we gain from it.
A well-known set of transformations is the column chart. If we have a qualitative scale (nominal, ordinal…) we can spread the data points evenly along the x-axis and draw a rectangle between each data point and the x-axis. This is a design option, since there is no x-coordinate. If we have a quantitative scale, data points along the x-axis may not be evenly spaced (depending on the data):
Do you think that this improves the insights we get from the data? I don’t think so. The chart now takes up much more space to present the same data, and uses the same sorting key in the table. If you want a different sorting key you must sort the table itself. You need to sort it to make sure that beef has the highest availability. That’s easy to see in the origin chart but harder to check in the bar chart.
We can keep making changes to the origin chart to get other chart types. Stack the data points, add rectangles and you get a stacked bar chart. Bend the bars and you get a doughnut chart:
Remove the hole and you get a pie chart.
By definition, a point has no dimensions. When you replace it with a rectangle you add one dimension (a rectangle in a bar chart is just a thick line). But you can add one more dimension and now you have areas like the treemap below (made with ManyEyes):
And if you add one more dimension you get a volume. Both areas and volumes maintain the relative sizes.
Problem is, it’s harder to compare them. Let me give you an example. Which one of the rectangles on the top of the treemap is bigger? How much bigger? Now try it with a bar chart. How long does it take you to tell the difference?
This is one of the best-known Tufte principles: “The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data”. In this case, you have a single series and you are changing both height and width, so it’s almost impossible to evaluate the data points correctly.
A chart is a transcription
A chart is a visual representation of an underlying table, a transcription between two sign systems. Each system has its strengths, but the meaning must be kept. This doesn’t happen in this column chart:
If you do not read the labels you will assume that pork availability is less than half of beef or chicken. Now check the labels: pork availability is 80% of beef. The error is due to the wrong format of the y-axis. We’ll discuss this in the chapter about scales.
There is never a direct and exact transcription. Something is always lost. That’s not a problem in data visualization, because usually benefits clearly outweigh the costs. But I would argue that the more transformations you apply to the origin chart the riskier your transcription is.
But why do we need this transcription, in the first place? If you spend enough time analyzing a table you’ll find correlations and spot outliers. However, you’ll see them in a split second when you look at the chart. So, the chart is much more efficient because our brain is much better at processing visual data.
The whole point of using charts is to discover patterns that makes some sense to you. This is important. There are obvious patterns that everyone can see but not everyone can read and understand what they mean.
A chart is a compression algorithm
Suppose you have a large dataset and you are unable to make sense of it without some tools. The tools you are going to choose are designed to simplify the table and find the key messages.
You know that this dataset contains answers to many questions. Unfortunately, you can’t simply funnel those records into your brain and let it analyze them. There is not a large enough channel to do that. So we assume that we must compress the data somehow.
You can take a radical approach and calculate averages for each field. It may work. Suppose your dataset contains a list of one million women and their age when their first baby was born. I would have to check, but I would say that this metric is much better at describing African women than US women (I’m assuming a higher variation in the US).
When you apply a compression algorithm you lose information. This image:
is a low resolution version of this image:
It’s the same picture, but the image above is so compressed that much of the information is lost. You can’t even say what the picture is about. But you can store thousands of pictures like this in your pen drive.On the other hand, you can only store a few pictures like the one below, but it is much richer and you have a much better idea of what the picture is about. So there is a trade-off between noise (what you are willing to lose) and signal (what you deem important and want to keep).
When you choose a tool to simplify your dataset you choose the best tool for the task at hand. It doesn’t matter if it’s a chart or some descriptive statistics. That’s why a question like “a table or a chart?” doesn’t make much sense without knowing what they will be used for.
Please note that I’m not saying that you just need to find the right tool and then funnel the result into your brain. It doesn’t work like that. Your brain is not a passive receiver: it is full of knowledge, memories of past experiences, cultural values. It actively adds new meanings to the original message. Your message lie in the eye of the beholder.
Takeaways
- A chart is a visual representation of distances between data points.
- A pattern is the way we group or connect data points.
- A chart type is a set of transformations we apply to this basic layout.
- These are (optional) graphic design transformations: we can use them to improve pattern discovery or user engagement;
- A chart is a transcription from a sign system into another sign system: something is lost something is gained, but it’s your job to ensure that the message is not corrupted in the process;
- A chart improves the signal-to-noise ratio; signal and noise are defined by the task at hand;
- A chart is always subjective: it reflects how you and the reader see the world.
Please share your comments below.
__________
I’m writing these pages to create a consistent approach to data visualization for Excel users. You can navigate this series using the links on the right.
If you like this page please share it using the buttons below. And don’t forget to follow me on Twitter!
What a great start. I’m looking forward to the next posts already.
Thanks Jamie, it will be available soon.
Good start.
Absolutely fantastic idea, and I’m feel very impatient for the future posts. There’s no doubt it will be first class reference for everyone who is in Excel and charting.
Awesome start, Jorge..
Great one! Nice job!
Very nice. Looking forward to subsequent chapters.
Thank you.
Very good. Very clear!!
excellent introduction. i look forward to future “pages.”
btw, i am guessing that you used voice recognition software to dictate this page. there are at least two typos (“a don’t think so”, “dim important”) that are homonymic.
Fantastic first piece. Can’t wait for the final version 🙂
Writing a book is a a great experience and a great way to pass on knowledge to others.
All the best
Bernard
Thanks Marc-Paul. Fixed!
Hey Bernard, long time no see! Thanks for stopping by!
Great article and an excellent beginning of your New Year’s resolution!
I believe your mission is spot on! If you want to understand data visualization – and the majority of us on the ‘invisible side’ really need to – such a resource on the ground principles for transforming data into meaningful information is a great help, and Excel is the perfect tool to start with – right here at our fingertips.
Looking forward to the next articles – 2012 will be a great year!
There is a real need for this book. My question is why introduce another term for what you call origin chart. In The Elements of Graphing Data, Bill Cleveland calls them one-dimensional scatter plots. In Creating More Effective Graphs, I follow Cleveland’s terminology and also mention the term strip plot which is used by some software such as R and S-Plus. In Information Graphics: A Comprehensive Illustrated Reference, Robert Harris calls them one-axis data distribution graphs as well as one-dimensional data distribution graphs. Do we really need another term?
Thanks for your comment Naomi. I’m generally comfortable with current terms and I don’t want to confuse things by giving new names to existing charts. “One-dimensional scatter plot” defines a chart type. With “origin chart” (or “designless (?) chart” or “pre-chart”) I mean what you get when you add the data points to the Cartesian plane. You can turn the “origin chart” into one-dimensional scatter plot or into a different chart type, but that’s a second step.
We’ll see in future pages why this is important. Here I just wanted to make users aware of this fundamental level where the chart is coming to life but it is nothing more than distances between data points.
Hi Jorge
Thanks for introducing me to the ‘art of chart’. Your articles always make me reconsider what it means to present information. In the end its all about telling a story quickly. This dissection of the mechanics of a chart really helps.
Thanks – Adam (dimodelo.com)
Hi Jorge a Great Idea. I basically try to do the same thing on my blog – http://www.visualquest.in Can I just suggest that you keep it simple – the language I mean – Warren Buffet – had the right idea when he said – write as if you are writing to your sister and offered his sister to write to if you do not have one – Dorris is her name. A Plain English Handbook : How to create clear SEC disclosure documents available at http://www.sec.gov/pdf/handbook.pdf
Do I follow my own suggestion – sometime laziness gets the better of me.