How to encrypt your emails with PGP keys

by Moritz Zajonz 1 Comment
How to encrypt your emails with PGP keys

Important Notice: Considering the recent disclosure of vulnerabilities in popular e-mail clients like Mozilla Thunderbird, we decided to delete this post. The current PGP implementation in email clients has vulnerabilities, that haven’t been fixed for now and will take time to get fixed. For more information about the technical side visit efail.de and for a detailed explanation, read the post by the Electronic Frontier Foundation. Thanks for your interest in this topic! We will update this post when new info is available.

 {Credit for the awesome featured image goes to Phil Ninh}

Adding tooltips to SVG graphics

Adding tooltips to SVG graphics

Java Script libraries and other tools offer cool ways to visualize data, but sometimes, you may want an even more customizable way of presenting a topic on the web. Maybe you already have the perfect graphic, but it’s not interactive yet. In this tutorial, we’ll show you a way to add tooltips to your SVG graphics.

One example for an interactive graphic a library can’t give you is the following site plan showing the Düsseldorfer Rhine Funfair with its attractions and food stands I built for the Rheinische Post. You can find additional information (in german) as tooltips when hovering over some of the attractions.

The basis for the map is a simple image I created with the design program Sketch. The trick I used to add tooltips to some of the map elements doesn’t require much coding and can be used to add tooltips to every website element including divs, text and pictures.

So let’s get right to it!
In this tutorial we will add tooltips to a very simple examplary SVG file you can download here. SVG stands for scaleable vector graphic. It’s the name of a file format where every image object is described not by the colour of its individual pixels, but by its basic shapes and their attributes. A big advantage SVG graphics have over other formats like PNG or JPEG ist that they’re scalable, just like their name says, so they never get blurry, no matter how much you zoom in. One way to create SVG files is to use a design program like Illustrator, Inkscape or Sketch.

Pro tip: RStudio lets you export charts in this format, too. Pimp up boring charts with some creative elements like I did with this stacked barchart on germanys reasons for trading trash:

The following screenshot shows the SVG image we are going to add tooltips to. It contains a big rectangle with a turquoise frame as the background, a red circle, a yellow star and a blue triangle.

Bildschirmfoto 2016-07-30 um 18.54.21

If you save the image as an SVG file and open it with a text editor, this is what you’ll get:

This is not the typical way you expect images to look like. And this is what makes SVG awesome: It translates your artwork into a snippet of XML code. XML files, or Extensible Markup Language files, look a lot like HTML. They’re not, though: XML was developed as a way to store data, while HTML is responsible for displaying data, mostly on the web. XML doesn’t have predefined tags like <a> or <body>, like HTML does. In XML, you’re entirely free to make up your own tags to describe different parts of your data.

SVG graphics still follow a very specific XML structure. That structure will work like a charm. Try putting the SVG code into your websites body. Set the SVG width to a 100%, delete the height settings and save the file as index.html. It should look like this:

Et voilà: The static image is ready for the web! Your browser can interpret the XML code of your SVG image easily and convert it back to a beautiful graphic.

 

Adding interactivness

Great! Now, we will use the jQuery Plugin Tooltipster to add information to this svg! The Plugin has nice presettings and a really good documentation. These three steps will lead you to your first interactive SVG:

Step 1: Download the Plugin

Bildschirmfoto 2016-07-30 um 19.42.35Create a new folder, name it svg_interactive (for example) and store the index.html file in it. You can find and download the Plugin’s scripts and style sheets on our GitHub repository or on the Plugins webpage. The main folder contains plenty of subfolders and documents you’ll barely need. For our example, we’ll only use tooltipster.bundle.min.jstooltipster.bundle.min.css and, for a little bit of extra style, the theme sheet tooltipster-sideTip-punk.min.css. You can either save the whole tooltipster folder in svg_interactive and then set the correct paths to those files in your html head, or you only store the three files you’ll actually use in your folder like we did as shown in the screenshot.

Step 2: Include the scripts

Set the paths to the style sheets and the JS script in the head of your webpage. Don’t forget to include jQuery, since Tooltipster depends on it:

Step 3: Add information to the image elements

Next, we’ll add a small snippet of Java Script code to the head manually. You could also save these few lines in an extra .js file and include them like the others, but for now, we’ll just keep it as an inline script.

This snippet defines a tooltip class with the theme tooltipster-punk and specifies that the content will be HTML. If you add

to an SVG element it will get a tooltipster-punk themed tooltip with HTML styled text. And that’s basically it!

For example, try:

Now, if you want to change the tooltip’s style you can save some CSS settings in a new stylesheet and include it in the webpages head. You can change the font family, the color of the popup background, the popup arrow or the default pink border-bottom, for example, like this:

Save the file as style.css and include it in the .html file. You can also change the trigger for opening and closing the tooltips in the manually added Java Script. You can set it to hover, click or optimize it for mobile devices with a customized trigger like this:

And this is the result: Click on it, hover over the symbols or try it on your smartphone!

So be creative with your data visualizations, even if complex Java Script libraries aren’t yours.

As always, you can get the entire code to this example on our Github Page. If you have any questions or feedback, feel free to leave a comment or mail us!

JavaScript: Coding marker maps with Leaflet.js

JavaScript: Coding marker maps with Leaflet.js

Showing locations on a map can be pretty cool to provide some context for your story or to give your reader an overview of where the story takes place. A good way to build a simple, yet responsive and professional looking map is to use the JavaScript library Leaflet.

In an earlier post, “Your first choropleth map“, we used Leaflet as well, but coded the map using the Leaflet R package, which works like a wrapper to translate the more common Leaflet functions into R syntax. It’s very useful if you’re more used to R syntax and don’t want to learn JavaScript anytime soon. But using the original JS library and coding the map with JavaScript will give us way more freedom when customizing the map, which is why we’ll try it that way today.

As an example, let’s start with a map of the locations of some data journalism newsrooms in the German speaking area. As always you can find all the code of this tutorial on our GitHub page.

This is what the finished map will look like:

HTML, CSS & a little JavaScript: The Basics (Part II)

HTML, CSS & a little JavaScript: The Basics (Part II)

Part I

This is part two of our tutorial on HTML, CSS and a little bit of JavaScript. In the last part, we learned about the basic functions of those three languages and have gotten to know a few useful HTML commands. If you’ve already read part one or you know all of that stuff anyway, this is the perfect spot for you to continue – by learning about CSS and how to implement JavaScript libraries into your webpage. Let’s get right to it!

CSS

CSS is short for Cascading Style Sheets. It’s a neat way to tell your browser how it should make your webpage look. You can style your HTML code in different ways.

For small adjustments, it might be enough to use inline style. With this method, you specify the style adjustments for specific elements right in the elements tag. For example, we talked about the <span>  element in part one. This is how you would use <span>  and some inline style to colour part of your sentence:

Here, you can use the command style like an attribute. You can specify multiple things in one inline style attribute, like specifying style="color:red;text-align:right"  to change text alignment and color in one go, but make sure to put quotes around the entire thing or it won’t work.

 Of course, there are a lot more possible settings than just the text color:

  • You’ve just learned what color  does. You can choose from the default color names or specify by hexadecimal code or rgb color code.
  • background-color  sets the background color of the current element. That could be some text or an entire paragraph, a section defined by a div or even the whole body of the document.
  • font-family  specifies the font for an element.
  • font-size  specifies the size of the current text element.
  • text-align  lets you choose to align the text left, right, centered or justified. initial chooses the default value and inherit aligns the text according to the alignment of its parent element.

There’s more, but let’s just work with that for now. Try using inline style to change the above properties in you HTML document that you’ve created in part one of this tutorial.

If you want to define a lot of properties or edit a lot of elements, inline style might be a little inconvenient. Thankfully, there is a different way: You could use the style tag. Anything you type between <style></style>  in your document will be interpreted as CSS commands. You’ll usually do that in the head, since it counts as meta information. But if you write some CSS code in the head of your document, how is your browser supposed to know which element you’re referring to?

That’s why there are things called ids and classes. For  every HTML element in your document, you can define one (or even multiple) classes and a unique ID. It will look like this.

We just gave the first div the classes red and pushright as well as the ID a1. The second one only has the class red, and the third one doesn’t have any class or ID at all. Classes are usually used for a larger amount of element that should have the same CSS properties. IDs are usually unique and are a useful way to single out one specific element you want to manipulate.

Let’s use CSS in our <style>  tags to turn the text color of anything with the red class to red and to align right any element with the class pushright. The element with the ID a1 should be in bold print. Also, all divs without any classes should have the text color purple and be aligned left. It would look something like this:

If you copy and paste the above text into the <head> of your document, you should see the changes we wanted to apply the next time you refresh the file in the browser.

There’s one mystery though. The first style rule is supposed to apply to all <div>  elements and change their text color to purple. But two of our divs are red anyway. Of course, that’s because they have the red class, and apparently, the style rules for classes are stronger than those for general elements if there’s a contradiction.

In reality, it’s a little more complicated than that. There’s whole tutorials centered around the so-called specificity rules of CSS that determine which style rules get applied and what place in the CSS hierarchy they occupy. If some rules don’t get applied even though you think they should, it’s probably a specificity error. As a basic rule, though, we can remember this hierarchy, in order from highest to lowest priority:

  1. Inline style
  2. IDs
  3. classes
  4. general elements

Try to play around with these CSS settings a little. In the meanwhile, we’ll take a quick look at how to incorporate JavaScript libraries into your webpage and how to manage the files that make up your site.

JavaScript libraries

JavaScript is quite a powerful language that lets you manipulate pretty much everything about your webpage. It might get complicated pretty quickly though if you want to create more complex elements. Thankfully, you don’t have to do all the work yourself. There are lots of JavaScript libraries available online, bundles of functions that serve a specific purpose. If you want to code interactive maps for a webpage, you might be interested in leaflet, for example. Highcharts or the very powerful D3 library are a wonderful tool for creating all kinds of data visualizations. To use the functions provided in those libraries, you’ll have to import them into your current document.

Most libraries even provide tutorials on how to do that. The only thing you actually need about that are the files where the functions of the library are defined. you’ll find the links to those on the webpages of the libraries. Once you know where the files are, you have to options for importing them. You can either use a hosted version that the creators of the library have uploaded to their servers. In that case, just import them by typing

for CSS stylesheets or

for Javascript files into the head of your HTML document. That’s all there is to it!

If you want to be on the safe side, you can also download the files from the links provided and save them on your own computer. If you do that, just replace the URL in the above commands with the path to the files on your computer. Not that scary, right.

File management: External files and file paths

While we’re at the topic of importing external scripts, let’s take a quick look at file management before we call it a day. Until now, we’ve coded our entire webpage in a single document. we’ve embedded CSS code with the <style>  tags and JavaScript with the <script>  tags. Once you build more complex pages, that might mean your document will get pretty lengthy and confusing. There’s an easy way out, though: Just save your CSS code as a .css file and your JavaScript bits as a .js file. It’s helpful to create a new folder to keep all the files in one place. The HTML document then serves as the hub that combines all the files into one page, which is why it’s customary to name your main HTML document index.html.

htmlfiles1

Then, you can refer to the files in the same way we explained above for libraries – after all, they’re nothing but scripts and style documents themselves.

If you put the links to the files in the same place we put the scripts and style rules before, there should be no difference between separate and joint files.

When specifying the file paths, you don’t have to use the complete paths. It’s possible to do so, but it’s not advisable. Because if you move your folder or decide to deploy your page to a server instead of keeping it on your computer where only you can see it, the paths change and your index.html won’t be able to locate the files it needs. Instead, you can use the relative file path, meaning the path relativ to the current location of your index.html. So if you keep all your files in one folder, it should just be enough to only specify the file name. That will make it look in the current directory. If your files are in a subfolder, just specify the path from index.html to the file with "subfolder1/subfolder2/filename.css" , for example. If the file is located in an above directory, ".."  is what you need to type to get to the parent directory of the current one. So if (for some reason) your files are organized like this:

htmlfiles2

Then what you need to type to get from you index.html file to your script.js file is:

 

That’s all, folks!

That’s it for today. We hope you got a vague idea of what to google if you want to learn about web development. This is a wonderful step on your journey towards becoming a proficient web developer – or, at least, towards understanding what the heck those proficient web developers are talking about. We’re going to dive a little deeper into the world of HTML, CSS and JavaScript in our next tutorials. You’re going to get to know some interesting libraries and tutorials on how to build webpages and interactive graphics yourself. So we hope you stick around for a while! In the meantime, feel free to comment or join our slack team if you have any questions or if you just want to say hi. See you soon!

Part 1

 

{Credits for the awesome featured image go to Phil Ninh}

HTML, CSS & JavaScript: The Basics (Part I)

HTML, CSS & JavaScript: The Basics (Part I)

Part II

Becoming a proficient web developer is hard — but understanding the basics isn’t. So this is what we’ll do today. By the end of this tutorial, you should have an idea of what people mean when they talk about HTML, CSS and JavaScript.

In this part, we’ll talk about the purpose of those three and learn a bit of basic HTML. In part two, we’ll learn a little more about CSS and JavaScript, especially about the use of JavaScript libraries, and how to combine all three to build a website. So let’s do this!

HTML, CSS and JavaScript are what most webpages are made of. They work together, each with a specific role. While you write a webpage, it will basically just look like a text document. You could write it in any text editor of your choice, although it’s probably a good idea to use a code editor like Atom or Sublime Text that will help you with formatting and code completion.

Once you’ve written the page, your browser is where the magic happens. It interprets the code you wrote in your .html, .css and .js text files and constructs a fully functioning, beautiful website from it – given there’s no errors in your code, of course. But that’s what we’re here for today.

HTML, or HyperText Markup Language, is the backbone of any webpage. It determines its basic structure and content. However, if you only use HTML, your webpage will end up looking like the very first pages from the early days of the web. No pretty formatting, no interactive elements. You’ll see it sometimes when your internet connection is very slow and your browser won’t load the page properly. It’s okay, but we can to better. That’s what CSS is for.

CSS is short for Cascading Style Sheets. This is the code that tells your browser how your webpage should look: What fonts and colors to use, what size your text should be etc.. But a pretty-looking website is not all you could want. You might want to interact with the webpage, click things, move stuff around and have it respond to your actions. HTML and CSS can’t do that very well, but JavaScript can.

JavaScript tells your browser how the page should behave. It’s what you write interactive graphics with, or pretty much any interactive element on your website. Compared to the other two, it’s a pretty complex language that you probably won’t learn entirely in a week or two. But that’s okay. For now, it’s enough to know what it does.

So if you only remember one thing from this tutorial, let it be this summary of what makes up a webpage:

HTML: Strucure/content
CSS: Style
JavaScript: Behaviour

If your mind has a little more room left right now, then let’s take a closer look at how to build a webpage. As mentioned, webpages are nothing but text documents that your browser knows how to interpret. To write these documents, you can use any code editor of your choice. I use Atom, which works very nicely as long as you don’t try to view giant datasets.

We’re going to look at some useful commands in this tutorial. But of course, there’s lots more to discover. If you want to look up more stuff, W3schools is a good place to start. You’ll find references for HTML, CSS, JavaScript and more. There are also some good MOOCs (Massive Open Online Courses) for web development. Take a look over at Coursera or CodeAcademy, for example.

It’s always a good idea to look at the code of other webpages and see how they work. To do so, right click anywhere on a website. The details are different for each browser, but there will be an option like “inspect element” and “show source code”. If you click it, you should see a lot of code. This is the website how your browser sees it. Try to understand what’s going on. This is a good source of inspiration for your own pages as well.

While coding your own projects, it might also be able to take a look at the console of your browser. You can find it by clicking “inspect element” (or whatever corresponds to that in your browser) and choosing the tab named “console” in the window that opens. The console will show you an error message if something isn’t right with your page. That message can help you identify the problem, or at least give you something to google if you don’t understand the problem right away.

But you’ll get to that later. First, let’s get to know the basics.

HTML

Let’s start with some HTML code. If you want to code along, open up a new document in the code editor of your choice. Copy the code below into the file and save it as, for example, index.html (the .html part is the important one).

This is the structure each HTML document follows. The first thing you’ll notice is lots of angle bracket action. The words in between the angle brackets are keywords we call tag names. Together with the bracket, they make up a tag that serves a specific function. HTML content is always placed between a start tag and an end tag that begins with a slash, like this:

Most tags accept attributes that specify how they should behave:

We’ll get to know some examples of that soon.

Any HTML document starts with the declaration <!DOCTYPE html> that tells the browser what kind of document to expect (html in this case, duh). Then, there’s the <html>  tag. It doesn’t close until the very end of the page and makes it extra clear that anything in between will, in fact, be HTML code.

The two basic parts of your page are the <head>  and the <body> . The head contains all the meta information you want or need to make your website function properly. It can be used, for example, to give your website a name ( <title>), load necessary CSS information ( <style>) or JavaScript sources ( <script>), which we’ll get to later on. You can also set lots of other stuff like the language or the character encoding in the head.

Anything in the head will not be directly visible on the webpage, so if you want to finally add some content, take a look at the body of your page. This is where you put your actual content. There’s quite a few options as to how to write your text and what media to incorporate into your page. Let’s take a quick look at a few important tags for that. If you want, copy them into your freshly made HTML document and play around with them a little. To see what they do, just save the file and open it in your browser (you can leave it open in your code editor as well, of course.

 

Formatting
  • <p>  stands for “paragraph”. Use it to write some important stuff onto your website, and don’t forget to close your paragraph with </p> . End tags are important for almost all the tags we’ll discuss here.
  • <h1> ... <h6>  are headings. The higher the number, the smaller the font of your heading will be.
  • <b>  or <strong>  ist used to use bold print like this for your text. Both commands do essentially the same thing.
  • <i>  or <em>  are both used to write in cursive.
  • <u> underlines text, but you can forget about that one instantly. On the web, it’s pretty much never a good idea to underline text that’s not a link, and those are underlined by default.
  • <br>  creates a line break. This is one of very few so-called void tags that don’t need a closing tag.
Lists
  • <ul>  creates an unordered list like the one you’re reading right now.
  • <ol>  creates an ordered list. It can be modified with some attributes. For example,  <ol start=343 reversed>  will start the list at number 343 instead of 1 and count down from there instead of up. Here you can find a list of all possible attributes.
  • <li>  creates a list element to populate the unordered or ordered lists you created.
  • Your shopping list in HTML might look something like this:
Media
  • Links: <a href=„URL“ target=_blank“>Link text</a> The “a” stands for anchor, and is used to create a link. Put the URL you want to refer to in the href attribute. target=”_blank” makes the link open in a new tab. If you leave it out, it will default to opening the link in the current tab.
  • Images: <img src=“file path/URL“ height=“550“ width=“100%> The img tag is another void tag, so you don’t need to type </img>  after it. Try experimenting with the width and height settings. If you specify them as just a number, the unit will default to pixels. By specifying percentages, you can scale the image according to the screen it’s viewed on. With the above specifications, for example, our picture will the entire width of our screen and be 550 pixels high. Note that if you specify both width and height, you might skew the proportions of your image. That is why sometimes, it’s better to only specify either the width or the height settings.
  • Audio: <audio src=“file path/URL“ controls autoplay loop> Sorry, your browser can’t handle this </audio> The audio tag lets you incorporate any audio file into your website. It’s probably a good idea to type in controls as an attribute so the user can pause and unpause the audio. Autoplay and loop are pretty self-explanatory. Better think twice about whether you actually want your audio to play instantly and on a loop, though. The text between the opening and closing tag of this line is only displayed if the browser can’t load the audio file to inform users that there should, in fact, be something here.
  • Video: <video src=“file path/URL“ controls autoplay loop> Sorry, your browser can’t handle this </video>  The video tag works pretty much the same as the audio tag, just, well, with videos. You can also specify width and height like with an image.

Inline Frames

An Inline frame, or iframe for short, works pretty much like the video ot audio tags, but it’s much more powerful. It let’s you display any content, even a whole other website, inside the frame you specify. It’s kind of like a window to another page. It’s very useful for easily displaying graphics or web apps and such, but it can be a bit tricky to make it properly responsive. So if you’ve got the chance, always try to natively integrate your pretty interactive graphics into your webpage. But if you don’t have the time or resources, an iframe will do just fine. Take a look at its possibly attributes here.

Structure
  • There’s some HTML tags that don’t appear to be doing anything on first sight. They’re structural elements designed to let you style the layout of your page.
  • <div>content</div>  A div, short for division, is one of those elements. It’s used to define a separate section in your HTML document. If you don’t use CSS or JavaScript to tell the browser what to do with it, you won’t even notice it’s there. But if used properly, divs can be a great layout tool.
  • <span>  is like the divs little sibling. While a div can be used to style entire sections of a text, move them or give them a specific format, <span> is more commonly used for styling smaller elements, like if you want to give part of your sentence a different color.

If you’ve followed this tutorial and tried out some of the options we’ve discussed so far, you should have something that resembles a website from the early 2000s by now. That’s pretty cool. Let’s see if we can make it even cooler. If you want to learn about styling your website with CSS and making it interactive with JavaScript, please continue with part two of our tutorial. See you there!

Part II

{Credits for the awesome featured image go to Phil Ninh}

R: Your first web application with shiny

R: Your first web application with shiny

Data driven journalism doesn’t necessarily involve user interaction. The analysis and its results may be enough to write a dashing article without ever mentioning a number. But let’s face it: We love to interact with data visualizations! To build those, some basic knowledge of JavaScript and HTML is usually required.
What? Your only coding skills are a bit of R? No problemo! What if I told you there was a way to interactively show users your most interesting R-results in a fancy web app?

Shiny to the rescue

Shiny is a highly customizable web application framework that turns your analysis into an interactive web app. No HTML, no JavaScript, no CSS required — although you can use it to expand your app. Also, the layout is responsive (although it’s not perfect for every phone).

In this tutorial, we will learn step by step how to code the shiny app on Germany’s air pollutants emissions that you can see below.

Similarity and distance in data: Part 2

Similarity and distance in data: Part 2

Part 1 | Code

In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.

Implementation in R: the dist() function

The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.

The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:

  • euclidean” Is the Euclidean distance.
  • maximum” The maximum distance.
  • manhattan” The Manhattan or city block distance.
  • canberra” Another name for the Manhattan distance.
  • binary” The Jaccard distance.
  • minkowski” Also called L-norm. The generalized version of Euclidean and Manhattan distance. Returns the Manhattan/Canberra distance if p = 1 and the Euclidean distance for p = 2.

We’re going to be working with the Jaccard distance in this lecture, but it works just as well for the other distance measures.

Download today’s dataset on similarities between right wing parties in Europe. It’s in the .Rdata file format, so you can load it into R with the load() function.

It contains the data frame values, which contains data on which european right wing parties agree with which right wing policies. The columns represent parties, while the rows represent political views. The entries are ones and zeros — one if the party agrees with the idea, zero if it doesn’t. This means we’re working with a binary or Boolean matrix (data frame, to be exact, but you get the idea). If you remember what we talked about in part one of this tutorial, you’ll realize this is a perfect situation for the Jaccard distance measure.

Since we want to visualize the similarities between the different parties, we want to calculate the distances over the columns of our dataset. This is a very important distinction, since some distance functions will calculate over rows per default, some over columns.

The dist() function works on rows. Since there’s no argument to switch to columns, we’ll have to transpose our dataset. Thankfully, this is pretty easy for data frames in R. We’ll just use t():

Note that with the default settings for diag and upper, the resulting “dist” object will have a triangle shape. That’s because we’re calculating the distance of every party to every other party, so the resulting matrix would be symmetric. Since we want to visualize our results, though, that’s what we want. So to prepare for visualization, we’ll have to do two things:

  • Add the diagonal and the upper triangle to make a complete rectangle shape.
  • Convert back from a dist object to a data frame so ggplot can work with the data.

Also, remember how we wanted to visualize the similarity between the parties, not their distance? And remember how distance and similarity metrics are inverse to each other? Once we’ve converted back to a data frame, we can simply use 1 - jacc  to get the Jaccard similarities from the Jaccard distances the dist() function returns.

If everything went according to plan, View(jaccsim) should show a symmetric data frame with values between zero and one, the diagonal consisting of all ones.

From here, let’s start preparing the dataset for ggplot visualization. For more info on how to work with ggplot, check out our tutorial, if you haven’t already.

Melting the data

If you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the specific tidy format ggplot works with. To melt our dataframe into a thin, long one instead of the rectangle shape it has right now, we’ll need to add a row containing the party names currently stored as row names, so the melting function will know what to melt on. Once we’ve done that, we can use melt() from the package reshape2 to convert our data.

Notice how we used the double colon “::” to specify to which package the function melt() belongs? This is a convenient alternative to loading an entire package if you only want to use one or two functions from it. As long as the package is installed, it will work just as well as library(reshape2).

The only thing left to do before we can start plotting is to make sure the parties are going to be in the right order when we plot them. When working with qualitative data, ggplot works with factors and plots the elements on the axes in the order of their factor levels. So we’ll make sure the levels are in the right order by specifying them explicitly:

The second argument to factor() specifies the levels and their order. Since we want to plot the similarities of each party with every other, we’re going to have party names on both x and y axes. By specifying one of the axes to be in reverse order with the rev() function, we make sure our plot looks nice and clean, as we’ll see in the next step: The actual visualization.

Visualization: Tile plot

There’s lots of ways to visualize similarity in data. When dealing with very small datasets like this one, one way to do it is using a heat map in tile format. At least that’s what I did, so that’s what you’re learning today. For each combination of two parties, there’s going to be a tile. The color of the tile shows the level of similarity between them: The more intense the color, the higher the similarity. The code we’re going to use is adapted from this blog post, which is really worth checking out.

First, we’re going zo build the basic plot structure:

Remember to load the ggplot2 package before you start plotting. We’re going to specify our x and y axes to be the two factors containing the party names with aes(names, variable). With geom_tile(), we define the basic structure of our plot: A set of tiles. They’re going to be filled according to the Jaccard similarities stored in the column value (aes(fill=value)). Their basic color is defined to be white, but we’ll create a gradient of blues with scale_file_gradient(). Try different color schemes if you like.

With these three basic set-up functions, you’re going to end up with something like this if you take a look at sim:

sim1

Not too bad, right? Notice how the diagonal of the tile matrix has the darkest possible blue. This makes sense, since those are the tile comparing one party to itself. The lighter the color, the lower the similarity between the parties.

But this plot doesn’t look as pretty as we’d like it to yet. The labels are to small, the axis labels aren’t necessary, the signature grey ggplot background isn’t visually appealing in this case and the legend doesn’t look as nice as it could.

Thankfully, ggplot2 let’s us edit all of that. Add these settings to your plot with the + operator and see what they do:

theme_light() is a standard theme with a clean look to it that fits our needs for this plot. The base_size argument lets us modify the text size of every text element in our plot. The default is 12px, but we want something a bit bigger for our plot. We don’t need any axis labels, so we’ll just pass the labs() function two empty strings.

The expand argument in the next two functions adds some space between the axes and our tiles, which we don’t want in this case. We’re going to set the argument to zero to make our plot look even cleaner. Also, we’re going to delete the legend title in the guides() function and remove the axis ticks with theme(). The text on the x axis looks a bit packed right now, so we’re going to rotate it a bit to give it more space. If everything worked out, your finished plot should look like this:

sim2

That’s better, isn’t it? Play around with the settings a little if you like. Maybe change the text size, the legend title of the rotation angle of the x axis text.

Anyway, though: You did it! Yay! This is, of course, only one way to visualize similarities. I’m sure there’s lots of other cool alternatives. If you find your own, leave a link in the comments, we’d love to hear about it. Until then: Experiment a little with similarity measures and ggplot options. See you in our next tutorial, our next meeting or on slack if you want to keep up with all of the hot Journocode gossip. Have fun!

Part 1 | Code

Similarity and distance in data: Part 1

Similarity and distance in data: Part 1

Part 2

In your work, you might encounter a situation where you want to analyze how similar your data points are to each other. Depending on the structure of your data though, “similar” may mean very different things. For example, if you’re working with records containing real-valued vectors, the notion of similarity has to be different than, say, for character strings or even whole documents. That’s why there’s a small collection of similarity measures to choose from, each tailored to different types of data and different purposes.

Before we get to know some of them, though, let’s think about what we’d expect such a measure to do. It’s easily done: If two objects are similar, the measure should be high (maximum for two perfectly similar objects). If they’re dissmilar, the value of the similarity measure should be low, so it should either converge to zero or to a negative number. We can, of course, set other expectations, but this is the bare minimum any measure of similarity should satisfy.

The more distant, the less similar

Because of these properties, similarity measures are often obtained by simply using the inverse of a distance metric. The intuition behind this is that the futher apart two objects are, the more dissimilar they are and the bigger the “distance” between them is. The more similar the objects are, the closer they are and the smaller the distance between them is. This is why in this tutorial, we’ll take a look at different ways to measure the elusive conept of a “distance” between two points of data.

Distance measures should have a few specific properties. They might sound a little math-y, but we’ll concentrate on the relatively straightforward concepts behind them:

d(x,y) \geq 0
The distance of two objects x and y can’t be less than zero.

d(x,y) = 0 \iff x = y
Two perfectly similar objects have distance zero.

d(x,y) = d(y,x)
The distance between x and y is the same as between y and x — it doesn’t matter which way you go.

d(x,z) \leq d(x,y) + d(x,z)
If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. This is called the triangle inequality.

Now that we got that out of the way, let’s look at a few distance measures. Again, if it sounds too mathematical, just take a deep breath and focus on the concepts. Or just skip the math altogether and look at how to implement and visualize distance measures in R, which we’ll focus on in the second part of this tutorial.

Euclidean or Non-Euclidean?

There’s two major classes of distance measures we can distinguish: Euclidean ones and Non-Euclidean ones. You should choose the appropriate one according to wether or not your data can be represented as points in a Euclidean space. A Euclidean space is any space that has some real-valued number of dimensions where points can be located. Your common two-dimensional or three-dimensional coordinate systems are examples for such spaces.

The important thing is that it has to be possible to define an average over the data points for it to be a Euclidean space. So if you’re working with vectors that have real-valued components you can compute an average over, then voilà, you’re working in a Euclidean space.

We’re going to look more closely at a few distance measures, Euclidean ones as well as Non-Euclidean ones:

Euclidean distance

This is pretty much the most common distance measure. It’s so common, in fact, that it’s often called the Euclidean distance, even though there’s many Euclidean distance measures, as we just learned. It’s defined as

\sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2}

This Euclidean distance adds up all the squared distances between corresponding data points and takes the square root of the result. Remember the Pythagorean theorem? If you look closely, the Euclidean distance is just that theorem solved for the hypothenuse — which is, in this case, the distance between x and y. The Euclidean distance is pretty solid: It’s bigger for larger distances, and smaller for closer data points. It can get arbitrarily large and is only zero if the data points are all exactly the same. That’s fine though. If you take a look at the requirements we set for a distance function, that’s exactly what we want.

Manhattan distance

\sum\limits_{i=1}^n |x_i - y_i|

Also known as city block distance, Canberra distance, taxicab metric or snake distance, this is definitely the distance measure with the coolest name(s). Incidentally, they’re also pretty decriptive: The Manhattan distance is the shortest distance a car would have to drive in a city block structure to get from x to y. Since it takes the absolute distances in each dimension before we sum them up, the Manhattan distance will always be bigger or equal to the Euclidean distance, which we can imagine as the linear distance between the two points.

Maximum distance

The maximum distance looks at the distance of two points in each dimension and selects the biggest one. This one is pretty straightforward, but we can express it as a fancy formula anyway:

\max_{i}(|x_i - y_i|)

L-Norm / Minkoswki distance

The L-Norm is the generalized version of the aforementioned distance measures. It is defined as

(\sum\limits_{i=1}^n |x_i - y_i|^p)^{\frac{1}{p}}

 If p is equal to 2, we get the Euclidean distance, which is why it’s also called the L2-Norm. p = 1 returns the Manhattan distance or L1-Norm and p = \infty  equals the maximum.

To sum up Euclidean distance measures, let’s take a look at how they work in a simple two-dimensional space. The maximum distance is equal to the biggest distance in any dimension. In this case, that’s the difference betwenn the x values of points p and q, which is 8. The Manhattan distance sums up the distances in each dimension, so it’s 8 + 3 = 11 in this case.

dist

What would the Euclidean distance, symbolizes by the orange line, be? Visualized like this, it’s pretty obvious how we can use the Pythagorean formula to get the result:

d_E(p,q)^2 = | p_x - q_x | ^2 + | p_y - q_y |^2 = 8^2 + 3^2\iff

 d(p,q) = \sqrt{8^2 + 3^2} = \sqrt{73} \approx 8,5

Amazing what can be done with a little trigonometry, right? Take a deep breath, because there’s more! Let’s look at some Non-Euclidean distance measures to make sure we can satisfy all our similarity measuring needs.

Cosine distance and similarity

The Cosine distance is defined by the angle between two vectors. As we know from basic linear algebra, the dot product of two vectors is defined by

x \cdot y = \|x\| \|y\| \cos{\theta}

where \theta is the angle between the two vectors. the smaller the angle is, the closer to 1 the cosine of the angle is, and the bigger the angle, the closer it is to -1. If you take a look at what we expected from a similarity measure, then the cosine meets our demands rather well. After all, if the angle between two vectors is very small, that means they’re very close together, and therefore more similar. So we’ll just solve the above equation for the cosine and define the cosine similariy to be equal to

\cos{\theta} = \frac{x \cdot y}{\|x\| \|y\|}

If we need to construct a distance measure from here, we can just take the inverse, as we learned before. So the cosine distance is defined as

1 - \cos{\theta}

Since we’re talking about vectors, it might be easy to assume this is also a Euclidean distance measure — and that may be right. If the vectors in question are actual real values, the cosine distance is Euclidean. But if the vectors have to be, say, integer components, we can’t compute an average over the points or we might get a non-integer result. Also, the cosine distance as such doesn’t satisfy the triangle inequality unless we alter it a bit. The cosine similarity, though, is a nice and efficient way to determine similarity in all kinds of multi-dimensional, numeric data.

Jaccard distance and similarity

Like with the cosine distance and similarity, the Jaccard distance is defines by one minus the Jaccard similarity. The Jaccard similarity uses a different approach to similarity than the measures we’ve seen so far. To compare two objects, it looks at the elements they have in common (the intersection) and divides it by the number of elements the two objects have in total (the union). Written out as a formula, that definition looks like this

\frac{X \cap Y}{X \cup Y}

\cap is the mathematical sign for intersection, \cup means union. With this definition, the similarity is only equal to one if all elements are the same and only becomes zero if all elements are different. Perfect for a similarity measure, but the wrong way around for a distance measure. This is easily solved by defining the Jaccard distance to be

1 - \frac{X \cap Y}{X \cup Y}

As an example, let’s compare the two sentences “Yesterday, the warm weather was perfect for my cat” and “My cat liked the warm weather yesterday”. Let’s call them X and Y. We could, of course, have used numbers or a mix of both as well, the Jaccard similarity doesn’t care.

The sentences have 6 words in common and 10 unique words in total. So the Jaccard similarity between them is 6/10 = 0.6 = 60 %. Their Jaccard distance is 1 – 0.6 = 0.4 = 40%.

A nice way to represent objects you want to compute the Jaccard similarity of is in the form of a Boolean matrix, a matrix with only ones and zeroes. The columns of our matrix symbolize the objects we want to find the similarity of and our rows are the unique elements of both objects — in this case, the words. One means the word is present in the object, zero means it isn’t. To compute the Jaccard similarity over the columns, all we have to do is count the rows where both objects have ones (6) and divide it by the total number of rows (10).

wordsXY
yesterday11
the11
warm11
weather11
was10
perfect10
for10
my11
cat11
liked01

 

We don’t have to stop at single sentences, though. The Jaccard similarity is an efficient way to compute similarity over entire documents — a lot of documents if necessary. Our corresponding Boolean matrix will get very big, of course, but since the formula is relatively simple, it scales rather well to large datasets.

Edit distance

Lastly, let’s think about how to measure the similarity of two character strings. One way to do that is the edit distance. The edit distance is simply the minimum number of inserts and deletes needed to get from one string to the other.

Let’s say we have the words “knock” and “flocks”. To get from one to the other, we have to delete one letter (k) and insert three (f,l,s):

knock → _nock → lock → flock → flocks

So the edit distance betweeen them is four. The edit distance is  a proper distance measure since it satisfies all four requirements we set at the beginning of this lesson.

  • The distance of two objects x and y can’t be less than zero. There’s no way to do a negative number of edits, so that’s true.
  • Two perfectly similar ojects have distance zero. We don’t need any edits to transform a word into itself.
  • The distance between x and y is the same as between y and x. Every insert into one word is equal to a delete from the other, so the paths you take are always inverse and have the same number of steps.
  • If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. Changing from word x to word y before you change to z is one way to go from x to z. The direct way might be shorter, but it can never be longer than the detour.

 

Congratulations! You made it through all of the math and learned a lot about some ways to measure distance and similarity in your data. In Part two of this lesson, we’re going to leave the theory behind us. We’ll take a look at how to actually compute these distance measures in R and think about how to visualize similarity in data.

Part 2