JournoCon 2018: A Recap of our first data-driven conference

JournoCon 2018: A Recap of our first data-driven conference

On March 24th, we hosted the JournoCon 2018. A data journalism bootcamp-slash-conference for students, freelance journalists and anyone wanting to learn more about the field. To those who were there: Thank you for making it an amazing day! To those who weren’t: Here’s a recap.

For beginners, it’s not always easy to get into data-driven work. You need skills you usually don’t learn in journalism schools, and in many newsrooms, there’s no infrastructure in place for journalists to augment their skills and connect with designers and developers. Still, what most journalists, in our experience, have, is motivation. What they generally do not have are

  1. time
  2. money.

So, when planning our first ddj conference, we wanted to make it affordable, even for students and trainees. You can’t teach everything there is to know about data-driven journalism in a day, of course. But we wanted to give learners a starting point, an overview of the steps involved. Basically: If people knew what to google when they got home, we would have done our job for the day. The result – after three months of planning – is the JournoCon 2018: one day, 60 participants, 7 squirrels, 5 amazing speakers from data journalism teams around Germany. We held it in German, so the slides and resources we’ll link to at the bottom aren’t available in English at the moment. But we’ll have more events, no doubt. For now, those of you who don’t speak German will have to make do with this recap, though.

The conference was structured around the data-driven workflow:

This is the simplified version, of course, but it’ll work for our purposes. So, we organized the conference into five blocks, corresponding to the five main steps in working with data. For each block, there were two tracks:

  1. a keynote track with two half-hour segments: Talks, discussions and presentations on aspects of the workflow step
  2. a workshop track with one hour-long course for those who want to try the hands-on approach to ddj.

Thanks to our amazing host, the Infographics Group, the conference took place in a beautiful space in the heart of Berlin. We were also happy to have some of our other supporters present at the conference. For our introduction, we decided to play a little warm-up game to get talking about data-driven journalism.

Finding ideas

For our first segment, we invited leaders from German ddj teams to talk about their process of coming up with ideas. Did you know that, in many cases, data journalism teams are also the ones bringing new technologies to the rest of their newsrooms? Thanks so much to Christina Elmer (Spiegel Online), Götz Gringmuth (rbb) and Michael Kreil (Infographics Group) for participating.

Researching data

Michael Hörz, data journalist at rbb and ddj lecturer, held the first workshop of the day: For some hands-on data research, he gave an overview of data sources available in Germany (You can find his slides here). In the meantime, our Sophie Rotgeri gave a short introduction to data sources in the keynote track as well, supplemented by an introduction to scraping from our hacker-in-chief, Sakander Zirai. If you want to try some hands-on scraping yourself, check out our tutorial on how to scrape a website with Python, R and the tool import.io.

Cleaning messy data

First things first: What is clean data, anyway? In his talk, Journocode squirrel Moritz Zajonz explained what “readable” means to a machine versus a human, and why CSVs and Excel often don’t get along. And cleaning data can be especially tricky when dealing with large amounts of it, as Vanessa Wormer, leader of the Süddeutsche Zeitung data team, knows from her work on the Panama Papers and the Paradise Papers. In her keynote, she presented the most helpful tools to deal with big datasets. For those who wanted to dig into data right away, Kira Schacht held a workshop on how to use Open Refine for data cleaning.

Analyzing data

When Journocode started out, our main goal was to teach each other about data analysis in the statistical programming language R. We haven’t forgotten our roots, so of course, the JournoCon featured an R crash course held by our founder Marie-Louise Timcke. We will publish the course as a tutorial on our website soon, so stay tuned! Until then, you can find all of our previous R tutorials here.

But R is far from the only thing journos can learn about analyzing data: In the keynote track, Moritz Zajonz gave an overview of the key lessons every journalist should know about statistics, and our computer scientists Elena Erdmann and Sakander Zirai talked about how to use machine learning methods in a journalistic way.

 

Data visualization

Finally, we were glad to have Lisa Charlotte Rost on board with a workshop on Datawrapper, a data visualization tool designed to be used in the newsroom. As for talks, we wrapped up the day with a look into the hottest trends in dataviz: Phillip Hafellner of Infographics Group talked all things AR, VR and Gamification. While opening the first beers of the evening, we ended on a little contest: In “painting by numbers”, our participants had five minutes to create a visualization for a dataset they had just learned about. And they came up with some amazing ideas. Take a look:

 

To JournoCon 2019… and beyond!

It was great fun seeing so many motivated people come together. All in all, we’re definitely doing this again! We also got some really helpful feedback from attendees after the conference. We’re going to take all of it to heart for JournoCon 2019, the main points being:

  • more space for workshops
  • longer and more pauses
  • more time for networking. Turns out, journalists really like talking to each other.

Some people asked for more in-depth ddj talks, others were glad we kept it simple, still others thought some points were too specific already. We definitely want to keep the JournoCon beginner-friendly, but we’d also love to give more advanced people the opportunity to talk shop. Maybe that’s going to mean a longer conference, maybe it means more tracks tailored to different levels of expertise. Maybe we’re going to organize more mini-conferences with more in-depth scopes – we don’t know yet. What we do know is that we’re only just getting started.

So, folks: See you for #JournoCon19 – at the latest!

More Info

You can find our conference pad here. It contains links to all our presentations, as well as further resources for many of the conference’s talks and workshops. For more impressions from the conference, be sure to check out the contents of #JournoCon18 on twitter. The complete program and more information on the speakers can also be found on the conference website.

On the website, you’ll also find more info on our supporters, without which the JournoCon 2018 would not have been possible. Special thanks go to Infographics Group Berlin for providing location and logistics.

 

JournoCon 2018: A data-driven conference

JournoCon 2018: A data-driven conference
JournoCon 2018

We are excited to announce our first one-day data journalism boot camp/conference in Berlin! If you’re a German-speaking person interested in data-driven stuff, be sure to check it out! We offer workshops, talks and discussions around every step of the data-driven workflow.

Note: Dear English-speaking person! We’re sorry, but this event will be held in German. Stay tuned for our next events, though. There will definitely be some in English!

Go to Project

HTML, CSS & a little JavaScript: The Basics (Part II)

HTML, CSS & a little JavaScript: The Basics (Part II)

Part I

This is part two of our tutorial on HTML, CSS and a little bit of JavaScript. In the last part, we learned about the basic functions of those three languages and have gotten to know a few useful HTML commands. If you’ve already read part one or you know all of that stuff anyway, this is the perfect spot for you to continue – by learning about CSS and how to implement JavaScript libraries into your webpage. Let’s get right to it!

CSS

CSS is short for Cascading Style Sheets. It’s a neat way to tell your browser how it should make your webpage look. You can style your HTML code in different ways.

For small adjustments, it might be enough to use inline style. With this method, you specify the style adjustments for specific elements right in the elements tag. For example, we talked about the <span>  element in part one. This is how you would use <span>  and some inline style to colour part of your sentence:

Here, you can use the command style like an attribute. You can specify multiple things in one inline style attribute, like specifying style="color:red;text-align:right"  to change text alignment and color in one go, but make sure to put quotes around the entire thing or it won’t work.

 Of course, there are a lot more possible settings than just the text color:

  • You’ve just learned what color  does. You can choose from the default color names or specify by hexadecimal code or rgb color code.
  • background-color  sets the background color of the current element. That could be some text or an entire paragraph, a section defined by a div or even the whole body of the document.
  • font-family  specifies the font for an element.
  • font-size  specifies the size of the current text element.
  • text-align  lets you choose to align the text left, right, centered or justified. initial chooses the default value and inherit aligns the text according to the alignment of its parent element.

There’s more, but let’s just work with that for now. Try using inline style to change the above properties in you HTML document that you’ve created in part one of this tutorial.

If you want to define a lot of properties or edit a lot of elements, inline style might be a little inconvenient. Thankfully, there is a different way: You could use the style tag. Anything you type between <style></style>  in your document will be interpreted as CSS commands. You’ll usually do that in the head, since it counts as meta information. But if you write some CSS code in the head of your document, how is your browser supposed to know which element you’re referring to?

That’s why there are things called ids and classes. For  every HTML element in your document, you can define one (or even multiple) classes and a unique ID. It will look like this.

We just gave the first div the classes red and pushright as well as the ID a1. The second one only has the class red, and the third one doesn’t have any class or ID at all. Classes are usually used for a larger amount of element that should have the same CSS properties. IDs are usually unique and are a useful way to single out one specific element you want to manipulate.

Let’s use CSS in our <style>  tags to turn the text color of anything with the red class to red and to align right any element with the class pushright. The element with the ID a1 should be in bold print. Also, all divs without any classes should have the text color purple and be aligned left. It would look something like this:

If you copy and paste the above text into the <head> of your document, you should see the changes we wanted to apply the next time you refresh the file in the browser.

There’s one mystery though. The first style rule is supposed to apply to all <div>  elements and change their text color to purple. But two of our divs are red anyway. Of course, that’s because they have the red class, and apparently, the style rules for classes are stronger than those for general elements if there’s a contradiction.

In reality, it’s a little more complicated than that. There’s whole tutorials centered around the so-called specificity rules of CSS that determine which style rules get applied and what place in the CSS hierarchy they occupy. If some rules don’t get applied even though you think they should, it’s probably a specificity error. As a basic rule, though, we can remember this hierarchy, in order from highest to lowest priority:

  1. Inline style
  2. IDs
  3. classes
  4. general elements

Try to play around with these CSS settings a little. In the meanwhile, we’ll take a quick look at how to incorporate JavaScript libraries into your webpage and how to manage the files that make up your site.

JavaScript libraries

JavaScript is quite a powerful language that lets you manipulate pretty much everything about your webpage. It might get complicated pretty quickly though if you want to create more complex elements. Thankfully, you don’t have to do all the work yourself. There are lots of JavaScript libraries available online, bundles of functions that serve a specific purpose. If you want to code interactive maps for a webpage, you might be interested in leaflet, for example. Highcharts or the very powerful D3 library are a wonderful tool for creating all kinds of data visualizations. To use the functions provided in those libraries, you’ll have to import them into your current document.

Most libraries even provide tutorials on how to do that. The only thing you actually need about that are the files where the functions of the library are defined. you’ll find the links to those on the webpages of the libraries. Once you know where the files are, you have to options for importing them. You can either use a hosted version that the creators of the library have uploaded to their servers. In that case, just import them by typing

for CSS stylesheets or

for Javascript files into the head of your HTML document. That’s all there is to it!

If you want to be on the safe side, you can also download the files from the links provided and save them on your own computer. If you do that, just replace the URL in the above commands with the path to the files on your computer. Not that scary, right.

File management: External files and file paths

While we’re at the topic of importing external scripts, let’s take a quick look at file management before we call it a day. Until now, we’ve coded our entire webpage in a single document. we’ve embedded CSS code with the <style>  tags and JavaScript with the <script>  tags. Once you build more complex pages, that might mean your document will get pretty lengthy and confusing. There’s an easy way out, though: Just save your CSS code as a .css file and your JavaScript bits as a .js file. It’s helpful to create a new folder to keep all the files in one place. The HTML document then serves as the hub that combines all the files into one page, which is why it’s customary to name your main HTML document index.html.

htmlfiles1

Then, you can refer to the files in the same way we explained above for libraries – after all, they’re nothing but scripts and style documents themselves.

If you put the links to the files in the same place we put the scripts and style rules before, there should be no difference between separate and joint files.

When specifying the file paths, you don’t have to use the complete paths. It’s possible to do so, but it’s not advisable. Because if you move your folder or decide to deploy your page to a server instead of keeping it on your computer where only you can see it, the paths change and your index.html won’t be able to locate the files it needs. Instead, you can use the relative file path, meaning the path relativ to the current location of your index.html. So if you keep all your files in one folder, it should just be enough to only specify the file name. That will make it look in the current directory. If your files are in a subfolder, just specify the path from index.html to the file with "subfolder1/subfolder2/filename.css" , for example. If the file is located in an above directory, ".."  is what you need to type to get to the parent directory of the current one. So if (for some reason) your files are organized like this:

htmlfiles2

Then what you need to type to get from you index.html file to your script.js file is:

 

That’s all, folks!

That’s it for today. We hope you got a vague idea of what to google if you want to learn about web development. This is a wonderful step on your journey towards becoming a proficient web developer – or, at least, towards understanding what the heck those proficient web developers are talking about. We’re going to dive a little deeper into the world of HTML, CSS and JavaScript in our next tutorials. You’re going to get to know some interesting libraries and tutorials on how to build webpages and interactive graphics yourself. So we hope you stick around for a while! In the meantime, feel free to comment or join our slack team if you have any questions or if you just want to say hi. See you soon!

Part 1

 

{Credits for the awesome featured image go to Phil Ninh}

HTML, CSS & JavaScript: The Basics (Part I)

HTML, CSS & JavaScript: The Basics (Part I)

Part II

Becoming a proficient web developer is hard — but understanding the basics isn’t. So this is what we’ll do today. By the end of this tutorial, you should have an idea of what people mean when they talk about HTML, CSS and JavaScript.

In this part, we’ll talk about the purpose of those three and learn a bit of basic HTML. In part two, we’ll learn a little more about CSS and JavaScript, especially about the use of JavaScript libraries, and how to combine all three to build a website. So let’s do this!

HTML, CSS and JavaScript are what most webpages are made of. They work together, each with a specific role. While you write a webpage, it will basically just look like a text document. You could write it in any text editor of your choice, although it’s probably a good idea to use a code editor like Atom or Sublime Text that will help you with formatting and code completion.

Once you’ve written the page, your browser is where the magic happens. It interprets the code you wrote in your .html, .css and .js text files and constructs a fully functioning, beautiful website from it – given there’s no errors in your code, of course. But that’s what we’re here for today.

HTML, or HyperText Markup Language, is the backbone of any webpage. It determines its basic structure and content. However, if you only use HTML, your webpage will end up looking like the very first pages from the early days of the web. No pretty formatting, no interactive elements. You’ll see it sometimes when your internet connection is very slow and your browser won’t load the page properly. It’s okay, but we can to better. That’s what CSS is for.

CSS is short for Cascading Style Sheets. This is the code that tells your browser how your webpage should look: What fonts and colors to use, what size your text should be etc.. But a pretty-looking website is not all you could want. You might want to interact with the webpage, click things, move stuff around and have it respond to your actions. HTML and CSS can’t do that very well, but JavaScript can.

JavaScript tells your browser how the page should behave. It’s what you write interactive graphics with, or pretty much any interactive element on your website. Compared to the other two, it’s a pretty complex language that you probably won’t learn entirely in a week or two. But that’s okay. For now, it’s enough to know what it does.

So if you only remember one thing from this tutorial, let it be this summary of what makes up a webpage:

HTML: Strucure/content
CSS: Style
JavaScript: Behaviour

If your mind has a little more room left right now, then let’s take a closer look at how to build a webpage. As mentioned, webpages are nothing but text documents that your browser knows how to interpret. To write these documents, you can use any code editor of your choice. I use Atom, which works very nicely as long as you don’t try to view giant datasets.

We’re going to look at some useful commands in this tutorial. But of course, there’s lots more to discover. If you want to look up more stuff, W3schools is a good place to start. You’ll find references for HTML, CSS, JavaScript and more. There are also some good MOOCs (Massive Open Online Courses) for web development. Take a look over at Coursera or CodeAcademy, for example.

It’s always a good idea to look at the code of other webpages and see how they work. To do so, right click anywhere on a website. The details are different for each browser, but there will be an option like “inspect element” and “show source code”. If you click it, you should see a lot of code. This is the website how your browser sees it. Try to understand what’s going on. This is a good source of inspiration for your own pages as well.

While coding your own projects, it might also be able to take a look at the console of your browser. You can find it by clicking “inspect element” (or whatever corresponds to that in your browser) and choosing the tab named “console” in the window that opens. The console will show you an error message if something isn’t right with your page. That message can help you identify the problem, or at least give you something to google if you don’t understand the problem right away.

But you’ll get to that later. First, let’s get to know the basics.

HTML

Let’s start with some HTML code. If you want to code along, open up a new document in the code editor of your choice. Copy the code below into the file and save it as, for example, index.html (the .html part is the important one).

This is the structure each HTML document follows. The first thing you’ll notice is lots of angle bracket action. The words in between the angle brackets are keywords we call tag names. Together with the bracket, they make up a tag that serves a specific function. HTML content is always placed between a start tag and an end tag that begins with a slash, like this:

Most tags accept attributes that specify how they should behave:

We’ll get to know some examples of that soon.

Any HTML document starts with the declaration <!DOCTYPE html> that tells the browser what kind of document to expect (html in this case, duh). Then, there’s the <html>  tag. It doesn’t close until the very end of the page and makes it extra clear that anything in between will, in fact, be HTML code.

The two basic parts of your page are the <head>  and the <body> . The head contains all the meta information you want or need to make your website function properly. It can be used, for example, to give your website a name ( <title>), load necessary CSS information ( <style>) or JavaScript sources ( <script>), which we’ll get to later on. You can also set lots of other stuff like the language or the character encoding in the head.

Anything in the head will not be directly visible on the webpage, so if you want to finally add some content, take a look at the body of your page. This is where you put your actual content. There’s quite a few options as to how to write your text and what media to incorporate into your page. Let’s take a quick look at a few important tags for that. If you want, copy them into your freshly made HTML document and play around with them a little. To see what they do, just save the file and open it in your browser (you can leave it open in your code editor as well, of course.

 

Formatting
  • <p>  stands for “paragraph”. Use it to write some important stuff onto your website, and don’t forget to close your paragraph with </p> . End tags are important for almost all the tags we’ll discuss here.
  • <h1> ... <h6>  are headings. The higher the number, the smaller the font of your heading will be.
  • <b>  or <strong>  ist used to use bold print like this for your text. Both commands do essentially the same thing.
  • <i>  or <em>  are both used to write in cursive.
  • <u> underlines text, but you can forget about that one instantly. On the web, it’s pretty much never a good idea to underline text that’s not a link, and those are underlined by default.
  • <br>  creates a line break. This is one of very few so-called void tags that don’t need a closing tag.
Lists
  • <ul>  creates an unordered list like the one you’re reading right now.
  • <ol>  creates an ordered list. It can be modified with some attributes. For example,  <ol start=343 reversed>  will start the list at number 343 instead of 1 and count down from there instead of up. Here you can find a list of all possible attributes.
  • <li>  creates a list element to populate the unordered or ordered lists you created.
  • Your shopping list in HTML might look something like this:
Media
  • Links: <a href=„URL“ target=_blank“>Link text</a> The “a” stands for anchor, and is used to create a link. Put the URL you want to refer to in the href attribute. target=”_blank” makes the link open in a new tab. If you leave it out, it will default to opening the link in the current tab.
  • Images: <img src=“file path/URL“ height=“550“ width=“100%> The img tag is another void tag, so you don’t need to type </img>  after it. Try experimenting with the width and height settings. If you specify them as just a number, the unit will default to pixels. By specifying percentages, you can scale the image according to the screen it’s viewed on. With the above specifications, for example, our picture will the entire width of our screen and be 550 pixels high. Note that if you specify both width and height, you might skew the proportions of your image. That is why sometimes, it’s better to only specify either the width or the height settings.
  • Audio: <audio src=“file path/URL“ controls autoplay loop> Sorry, your browser can’t handle this </audio> The audio tag lets you incorporate any audio file into your website. It’s probably a good idea to type in controls as an attribute so the user can pause and unpause the audio. Autoplay and loop are pretty self-explanatory. Better think twice about whether you actually want your audio to play instantly and on a loop, though. The text between the opening and closing tag of this line is only displayed if the browser can’t load the audio file to inform users that there should, in fact, be something here.
  • Video: <video src=“file path/URL“ controls autoplay loop> Sorry, your browser can’t handle this </video>  The video tag works pretty much the same as the audio tag, just, well, with videos. You can also specify width and height like with an image.

Inline Frames

An Inline frame, or iframe for short, works pretty much like the video ot audio tags, but it’s much more powerful. It let’s you display any content, even a whole other website, inside the frame you specify. It’s kind of like a window to another page. It’s very useful for easily displaying graphics or web apps and such, but it can be a bit tricky to make it properly responsive. So if you’ve got the chance, always try to natively integrate your pretty interactive graphics into your webpage. But if you don’t have the time or resources, an iframe will do just fine. Take a look at its possibly attributes here.

Structure
  • There’s some HTML tags that don’t appear to be doing anything on first sight. They’re structural elements designed to let you style the layout of your page.
  • <div>content</div>  A div, short for division, is one of those elements. It’s used to define a separate section in your HTML document. If you don’t use CSS or JavaScript to tell the browser what to do with it, you won’t even notice it’s there. But if used properly, divs can be a great layout tool.
  • <span>  is like the divs little sibling. While a div can be used to style entire sections of a text, move them or give them a specific format, <span> is more commonly used for styling smaller elements, like if you want to give part of your sentence a different color.

If you’ve followed this tutorial and tried out some of the options we’ve discussed so far, you should have something that resembles a website from the early 2000s by now. That’s pretty cool. Let’s see if we can make it even cooler. If you want to learn about styling your website with CSS and making it interactive with JavaScript, please continue with part two of our tutorial. See you there!

Part II

{Credits for the awesome featured image go to Phil Ninh}

Similarity and distance in data: Part 2

Similarity and distance in data: Part 2

Part 1 | Code

In part one of this tutorial, you learned about what distance and similarity mean for data and how to measure it. Now, let’s see how we can implement distance measures in R. We’re going to look at the built-in dist() function and visualize similarities with a ggplot2 tile plot, also called a heatmap.

Implementation in R: the dist() function

The simplest way to do distance measures in R is the dist() function. It works with matrices as well as data frames and has options for a lot of the measures we’ve gotten to know in the last part.

The crucial argument here is method. It has six options — actually more like four and a half, but you’ll see:

  • euclidean” Is the Euclidean distance.
  • maximum” The maximum distance.
  • manhattan” The Manhattan or city block distance.
  • canberra” Another name for the Manhattan distance.
  • binary” The Jaccard distance.
  • minkowski” Also called L-norm. The generalized version of Euclidean and Manhattan distance. Returns the Manhattan/Canberra distance if p = 1 and the Euclidean distance for p = 2.

We’re going to be working with the Jaccard distance in this lecture, but it works just as well for the other distance measures.

Download today’s dataset on similarities between right wing parties in Europe. It’s in the .Rdata file format, so you can load it into R with the load() function.

It contains the data frame values, which contains data on which european right wing parties agree with which right wing policies. The columns represent parties, while the rows represent political views. The entries are ones and zeros — one if the party agrees with the idea, zero if it doesn’t. This means we’re working with a binary or Boolean matrix (data frame, to be exact, but you get the idea). If you remember what we talked about in part one of this tutorial, you’ll realize this is a perfect situation for the Jaccard distance measure.

Since we want to visualize the similarities between the different parties, we want to calculate the distances over the columns of our dataset. This is a very important distinction, since some distance functions will calculate over rows per default, some over columns.

The dist() function works on rows. Since there’s no argument to switch to columns, we’ll have to transpose our dataset. Thankfully, this is pretty easy for data frames in R. We’ll just use t():

Note that with the default settings for diag and upper, the resulting “dist” object will have a triangle shape. That’s because we’re calculating the distance of every party to every other party, so the resulting matrix would be symmetric. Since we want to visualize our results, though, that’s what we want. So to prepare for visualization, we’ll have to do two things:

  • Add the diagonal and the upper triangle to make a complete rectangle shape.
  • Convert back from a dist object to a data frame so ggplot can work with the data.

Also, remember how we wanted to visualize the similarity between the parties, not their distance? And remember how distance and similarity metrics are inverse to each other? Once we’ve converted back to a data frame, we can simply use 1 - jacc  to get the Jaccard similarities from the Jaccard distances the dist() function returns.

If everything went according to plan, View(jaccsim) should show a symmetric data frame with values between zero and one, the diagonal consisting of all ones.

From here, let’s start preparing the dataset for ggplot visualization. For more info on how to work with ggplot, check out our tutorial, if you haven’t already.

Melting the data

If you’ve followed our tutorial on the tidy data principles ggplot is built on, you’ll remember how we need to convert our data to the specific tidy format ggplot works with. To melt our dataframe into a thin, long one instead of the rectangle shape it has right now, we’ll need to add a row containing the party names currently stored as row names, so the melting function will know what to melt on. Once we’ve done that, we can use melt() from the package reshape2 to convert our data.

Notice how we used the double colon “::” to specify to which package the function melt() belongs? This is a convenient alternative to loading an entire package if you only want to use one or two functions from it. As long as the package is installed, it will work just as well as library(reshape2).

The only thing left to do before we can start plotting is to make sure the parties are going to be in the right order when we plot them. When working with qualitative data, ggplot works with factors and plots the elements on the axes in the order of their factor levels. So we’ll make sure the levels are in the right order by specifying them explicitly:

The second argument to factor() specifies the levels and their order. Since we want to plot the similarities of each party with every other, we’re going to have party names on both x and y axes. By specifying one of the axes to be in reverse order with the rev() function, we make sure our plot looks nice and clean, as we’ll see in the next step: The actual visualization.

Visualization: Tile plot

There’s lots of ways to visualize similarity in data. When dealing with very small datasets like this one, one way to do it is using a heat map in tile format. At least that’s what I did, so that’s what you’re learning today. For each combination of two parties, there’s going to be a tile. The color of the tile shows the level of similarity between them: The more intense the color, the higher the similarity. The code we’re going to use is adapted from this blog post, which is really worth checking out.

First, we’re going zo build the basic plot structure:

Remember to load the ggplot2 package before you start plotting. We’re going to specify our x and y axes to be the two factors containing the party names with aes(names, variable). With geom_tile(), we define the basic structure of our plot: A set of tiles. They’re going to be filled according to the Jaccard similarities stored in the column value (aes(fill=value)). Their basic color is defined to be white, but we’ll create a gradient of blues with scale_file_gradient(). Try different color schemes if you like.

With these three basic set-up functions, you’re going to end up with something like this if you take a look at sim:

sim1

Not too bad, right? Notice how the diagonal of the tile matrix has the darkest possible blue. This makes sense, since those are the tile comparing one party to itself. The lighter the color, the lower the similarity between the parties.

But this plot doesn’t look as pretty as we’d like it to yet. The labels are to small, the axis labels aren’t necessary, the signature grey ggplot background isn’t visually appealing in this case and the legend doesn’t look as nice as it could.

Thankfully, ggplot2 let’s us edit all of that. Add these settings to your plot with the + operator and see what they do:

theme_light() is a standard theme with a clean look to it that fits our needs for this plot. The base_size argument lets us modify the text size of every text element in our plot. The default is 12px, but we want something a bit bigger for our plot. We don’t need any axis labels, so we’ll just pass the labs() function two empty strings.

The expand argument in the next two functions adds some space between the axes and our tiles, which we don’t want in this case. We’re going to set the argument to zero to make our plot look even cleaner. Also, we’re going to delete the legend title in the guides() function and remove the axis ticks with theme(). The text on the x axis looks a bit packed right now, so we’re going to rotate it a bit to give it more space. If everything worked out, your finished plot should look like this:

sim2

That’s better, isn’t it? Play around with the settings a little if you like. Maybe change the text size, the legend title of the rotation angle of the x axis text.

Anyway, though: You did it! Yay! This is, of course, only one way to visualize similarities. I’m sure there’s lots of other cool alternatives. If you find your own, leave a link in the comments, we’d love to hear about it. Until then: Experiment a little with similarity measures and ggplot options. See you in our next tutorial, our next meeting or on slack if you want to keep up with all of the hot Journocode gossip. Have fun!

Part 1 | Code

Similarity and distance in data: Part 1

Similarity and distance in data: Part 1

Part 2

In your work, you might encounter a situation where you want to analyze how similar your data points are to each other. Depending on the structure of your data though, “similar” may mean very different things. For example, if you’re working with records containing real-valued vectors, the notion of similarity has to be different than, say, for character strings or even whole documents. That’s why there’s a small collection of similarity measures to choose from, each tailored to different types of data and different purposes.

Before we get to know some of them, though, let’s think about what we’d expect such a measure to do. It’s easily done: If two objects are similar, the measure should be high (maximum for two perfectly similar objects). If they’re dissmilar, the value of the similarity measure should be low, so it should either converge to zero or to a negative number. We can, of course, set other expectations, but this is the bare minimum any measure of similarity should satisfy.

The more distant, the less similar

Because of these properties, similarity measures are often obtained by simply using the inverse of a distance metric. The intuition behind this is that the futher apart two objects are, the more dissimilar they are and the bigger the “distance” between them is. The more similar the objects are, the closer they are and the smaller the distance between them is. This is why in this tutorial, we’ll take a look at different ways to measure the elusive conept of a “distance” between two points of data.

Distance measures should have a few specific properties. They might sound a little math-y, but we’ll concentrate on the relatively straightforward concepts behind them:

d(x,y) \geq 0
The distance of two objects x and y can’t be less than zero.

d(x,y) = 0 \iff x = y
Two perfectly similar objects have distance zero.

d(x,y) = d(y,x)
The distance between x and y is the same as between y and x — it doesn’t matter which way you go.

d(x,z) \leq d(x,y) + d(x,z)
If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. This is called the triangle inequality.

Now that we got that out of the way, let’s look at a few distance measures. Again, if it sounds too mathematical, just take a deep breath and focus on the concepts. Or just skip the math altogether and look at how to implement and visualize distance measures in R, which we’ll focus on in the second part of this tutorial.

Euclidean or Non-Euclidean?

There’s two major classes of distance measures we can distinguish: Euclidean ones and Non-Euclidean ones. You should choose the appropriate one according to wether or not your data can be represented as points in a Euclidean space. A Euclidean space is any space that has some real-valued number of dimensions where points can be located. Your common two-dimensional or three-dimensional coordinate systems are examples for such spaces.

The important thing is that it has to be possible to define an average over the data points for it to be a Euclidean space. So if you’re working with vectors that have real-valued components you can compute an average over, then voilà, you’re working in a Euclidean space.

We’re going to look more closely at a few distance measures, Euclidean ones as well as Non-Euclidean ones:

Euclidean distance

This is pretty much the most common distance measure. It’s so common, in fact, that it’s often called the Euclidean distance, even though there’s many Euclidean distance measures, as we just learned. It’s defined as

\sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2}

This Euclidean distance adds up all the squared distances between corresponding data points and takes the square root of the result. Remember the Pythagorean theorem? If you look closely, the Euclidean distance is just that theorem solved for the hypothenuse — which is, in this case, the distance between x and y. The Euclidean distance is pretty solid: It’s bigger for larger distances, and smaller for closer data points. It can get arbitrarily large and is only zero if the data points are all exactly the same. That’s fine though. If you take a look at the requirements we set for a distance function, that’s exactly what we want.

Manhattan distance

\sum\limits_{i=1}^n |x_i - y_i|

Also known as city block distance, Canberra distance, taxicab metric or snake distance, this is definitely the distance measure with the coolest name(s). Incidentally, they’re also pretty decriptive: The Manhattan distance is the shortest distance a car would have to drive in a city block structure to get from x to y. Since it takes the absolute distances in each dimension before we sum them up, the Manhattan distance will always be bigger or equal to the Euclidean distance, which we can imagine as the linear distance between the two points.

Maximum distance

The maximum distance looks at the distance of two points in each dimension and selects the biggest one. This one is pretty straightforward, but we can express it as a fancy formula anyway:

\max_{i}(|x_i - y_i|)

L-Norm / Minkoswki distance

The L-Norm is the generalized version of the aforementioned distance measures. It is defined as

(\sum\limits_{i=1}^n |x_i - y_i|^p)^{\frac{1}{p}}

 If p is equal to 2, we get the Euclidean distance, which is why it’s also called the L2-Norm. p = 1 returns the Manhattan distance or L1-Norm and p = \infty  equals the maximum.

To sum up Euclidean distance measures, let’s take a look at how they work in a simple two-dimensional space. The maximum distance is equal to the biggest distance in any dimension. In this case, that’s the difference betwenn the x values of points p and q, which is 8. The Manhattan distance sums up the distances in each dimension, so it’s 8 + 3 = 11 in this case.

dist

What would the Euclidean distance, symbolizes by the orange line, be? Visualized like this, it’s pretty obvious how we can use the Pythagorean formula to get the result:

d_E(p,q)^2 = | p_x - q_x | ^2 + | p_y - q_y |^2 = 8^2 + 3^2\iff

 d(p,q) = \sqrt{8^2 + 3^2} = \sqrt{73} \approx 8,5

Amazing what can be done with a little trigonometry, right? Take a deep breath, because there’s more! Let’s look at some Non-Euclidean distance measures to make sure we can satisfy all our similarity measuring needs.

Cosine distance and similarity

The Cosine distance is defined by the angle between two vectors. As we know from basic linear algebra, the dot product of two vectors is defined by

x \cdot y = \|x\| \|y\| \cos{\theta}

where \theta is the angle between the two vectors. the smaller the angle is, the closer to 1 the cosine of the angle is, and the bigger the angle, the closer it is to -1. If you take a look at what we expected from a similarity measure, then the cosine meets our demands rather well. After all, if the angle between two vectors is very small, that means they’re very close together, and therefore more similar. So we’ll just solve the above equation for the cosine and define the cosine similariy to be equal to

\cos{\theta} = \frac{x \cdot y}{\|x\| \|y\|}

If we need to construct a distance measure from here, we can just take the inverse, as we learned before. So the cosine distance is defined as

1 - \cos{\theta}

Since we’re talking about vectors, it might be easy to assume this is also a Euclidean distance measure — and that may be right. If the vectors in question are actual real values, the cosine distance is Euclidean. But if the vectors have to be, say, integer components, we can’t compute an average over the points or we might get a non-integer result. Also, the cosine distance as such doesn’t satisfy the triangle inequality unless we alter it a bit. The cosine similarity, though, is a nice and efficient way to determine similarity in all kinds of multi-dimensional, numeric data.

Jaccard distance and similarity

Like with the cosine distance and similarity, the Jaccard distance is defines by one minus the Jaccard similarity. The Jaccard similarity uses a different approach to similarity than the measures we’ve seen so far. To compare two objects, it looks at the elements they have in common (the intersection) and divides it by the number of elements the two objects have in total (the union). Written out as a formula, that definition looks like this

\frac{X \cap Y}{X \cup Y}

\cap is the mathematical sign for intersection, \cup means union. With this definition, the similarity is only equal to one if all elements are the same and only becomes zero if all elements are different. Perfect for a similarity measure, but the wrong way around for a distance measure. This is easily solved by defining the Jaccard distance to be

1 - \frac{X \cap Y}{X \cup Y}

As an example, let’s compare the two sentences “Yesterday, the warm weather was perfect for my cat” and “My cat liked the warm weather yesterday”. Let’s call them X and Y. We could, of course, have used numbers or a mix of both as well, the Jaccard similarity doesn’t care.

The sentences have 6 words in common and 10 unique words in total. So the Jaccard similarity between them is 6/10 = 0.6 = 60 %. Their Jaccard distance is 1 – 0.6 = 0.4 = 40%.

A nice way to represent objects you want to compute the Jaccard similarity of is in the form of a Boolean matrix, a matrix with only ones and zeroes. The columns of our matrix symbolize the objects we want to find the similarity of and our rows are the unique elements of both objects — in this case, the words. One means the word is present in the object, zero means it isn’t. To compute the Jaccard similarity over the columns, all we have to do is count the rows where both objects have ones (6) and divide it by the total number of rows (10).

wordsXY
yesterday11
the11
warm11
weather11
was10
perfect10
for10
my11
cat11
liked01

 

We don’t have to stop at single sentences, though. The Jaccard similarity is an efficient way to compute similarity over entire documents — a lot of documents if necessary. Our corresponding Boolean matrix will get very big, of course, but since the formula is relatively simple, it scales rather well to large datasets.

Edit distance

Lastly, let’s think about how to measure the similarity of two character strings. One way to do that is the edit distance. The edit distance is simply the minimum number of inserts and deletes needed to get from one string to the other.

Let’s say we have the words “knock” and “flocks”. To get from one to the other, we have to delete one letter (k) and insert three (f,l,s):

knock → _nock → lock → flock → flocks

So the edit distance betweeen them is four. The edit distance is  a proper distance measure since it satisfies all four requirements we set at the beginning of this lesson.

  • The distance of two objects x and y can’t be less than zero. There’s no way to do a negative number of edits, so that’s true.
  • Two perfectly similar ojects have distance zero. We don’t need any edits to transform a word into itself.
  • The distance between x and y is the same as between y and x. Every insert into one word is equal to a delete from the other, so the paths you take are always inverse and have the same number of steps.
  • If you take a “detour” via y on your way from x to z, your path can’t be shorter than if you had taken the direct route. Changing from word x to word y before you change to z is one way to go from x to z. The direct way might be shorter, but it can never be longer than the detour.

 

Congratulations! You made it through all of the math and learned a lot about some ways to measure distance and similarity in your data. In Part two of this lesson, we’re going to leave the theory behind us. We’ll take a look at how to actually compute these distance measures in R and think about how to visualize similarity in data.

Part 2

Project: Visualizing WhatsApp chat logs –
Part 1: Cleaning the data

Project: Visualizing WhatsApp chat logs – <br>Part 1: Cleaning the data

Part 2 | Code

A few weeks ago, we discovered it’s possible to export WhatsApp conversation logs as a .txt file. It’s quite an interesting piece of data, so we figured, why not analyze it? So here we go: A code-along R project in two steps.

  1. Cleaning the data: That’s what this part is for. We’ll get the .txt file ready to be properly evaluated.
  2. Visualizing the data: That’s what we’ll talk about in part two — creating some interesting visuals for our chat logs.

You can find the entire code for the project on our github page. In this part, we’ll walk you through the process of cleaning a dataset step by step. This is what the final product of part two will look like (Of course, yours could be something entirely else. There are heaps of great material in those logs):

 

is this the tooltip?

Getting the data

First things first: We’ll need some data to work with. WhatsApp has a built-in function for exporting chat logs via e-mail. Just choose any chat you want to analyze. Group chats are especially interesting for this particular visualization, since we’ll take a look at the number and timing of messages. In case you already know how to get the logs, you can just skip this step.

How to get the chat logs

Depending on what kind of phone you have, this might work a little differently. But this is how it works on Android:

While in a chat, tap the three dots in the upper right corner and select “More”, the last option.

Screenshot_2016-01-28-12-02-49

Then, select “E-mail chat”, the second option. It will let you choose an address to send to and voilà, there’s your text file.

Screenshot_2016-01-28-12-02-54

Alternatively, you can also go via the WhatsApp main page. Tap the three dots and select “Settings” > “Chat history” > “Send chat history”. Then, just select the chat you want to export.

Our to-do-list

The .txt file you’ll get is, well, not as difficult to handle as it could be, but it has a few quirks. The basic structure is pretty easy. Every row follows this basic pattern:

<time stamp> – <name>: <text>

Looks alright, doesn’t it? But there’s a few issues we’ll run into, especially if we don’t want to analyze just the  message count but the content as well. Some of them are easy to correct, like the dash between the time stamp and the name, some are more complicated:

  • The .txt file isn’t formatted like a .csv or a proper table. That is, not every row has the same amount of elements
  • Some rows don’t have a time stamp, but immediately start with text if a message has multiple paragraphs.
  • Some names have one, some names have two words, depending on wether they’re saved with or without surname.
  • The time stamp isn’t formatted to be evaluated and spans multiple columns.
  • Names have a colon at the end. That doesn’t look nice in graphics.

Converting and importing the file

Before we can start cleaning in R, we have to tackle the first issue on our to-do-list. If you try to read the text file into R right away, you’ll get an error:

We’ll have to convert it into a proper table structure. There’s multiple ways to do that. We used Excel to convert the file to .csv. If you already have your own favourite way to convert the file, you can do it your way.

Converting text to .csv with Excel

First, obviously, open Excel. Open the .txt file. Remember to switch from “Excel files” to “All file types” in the drop-down menu so your text file is visible. You should be led to the the text import wizard.

In the first step, set the file type to “Delimited”.

text-import-wizard-step-1

Then, separate by spaces (remember to un-check “Tab”).

text-import-wizard-step-2

In the last step, you can just leave the data format at “General” and click “Finish”.

text-import-wizard-step-3

The resulting dataset should look somewhat like this:

csv

Just save this as a .csv file and you should be good to go.

Now that we’ve got a proper .csv file, we can start cleaning it in R. First, read in the file and save it as a variable. For more info on data import in R, check our our previous tutorial.

Check that you specify the right separator. It’s probably a comma or a semicolon. Just open your file in a text editor and find out.

Regular expressions

To clean up the file, we’ll need to work with regular expressions. They’re used for finding and manipulating patterns in text and have their own syntax. Regex syntax is a little hard to wrap your head around, but there’s lots of reference sheets and expression testers like regexr online that help you translate. In R, you’ll use the grep() function family for text matching and replacement. Let’s try it out. As mentioned on our to-do-list, some rows don’t start with a timestamp. Visually, they’re easy to spot, because they don’t have a number at the beginning.

In regex, the pattern “character string without a digit at the beginning” is translated as “^\D”“^” matches the beginning of a string, “\D” means “anything except a digit”.

The call  to grep(“^\\D”, chat[,1]) tells R to look in the first column of our chat for rows that fit the regular expression “^\D”. The second backslash is an excape character only necessary in R, because the backslash serves other purposes there as well.

We’re not going to get into the details of regular expressions here, that’s a post for another time. Feel free to look them up on your own, though. If you want to analyze text files in your projects, it’s pretty sure you’ll encounter them anyway.

Shifting stampless rows

We’re going to shift the rows without time stamp a bit to the right, so we can copy down the time stamp and name of the sender. First, we’re going to make room at the end of the data frame, in case the stampless rows also happen to be very long messages:

Then, we’ll just move the first five rows that block the space for the time stamp to the end of the line, leaving the beginning of the line blank for the time being. We’ll use a for loop that goes through every row without a time stamp and moves the first five elements to the end of the line.

We’ll write a tutorial on loops and conditional statements soon, but in the meantime, check out this short explanation of you want to know more about loops.

We could copy down the time stamp and name right now, but since there’s still a few issues with the name columns, we’ll sort out those first. Before we do that, though, we’ll just quickly delete any entirely empty rows that might have snuck in. We’ll use the apply() function for that. It’s basically like a loop, just much faster and easier to handle in most cases. The R package swirl contains in-console tutorials and has a great one for the apply() function family as well.

Cleaning the surname column

Now, some contacts might be saved by first name, some by first and last name, right? So the column containing the surnames also sometimes contains a bit of text. The difference is, the text bit probably won’t end with a colon, the surnames definitely will. We can use regular expressions to filter the surname column accordingly.
Also, some messages aren’t actually chat content, but rather activity notifications like adding new members to a group. They’ll say something like “Marie Timcke added you”. Good thing is: Those messages don’t contain colons either, so we can use the same regular expression for the surnames and the notifications.

The regex we’ll use is “.+:$”. It matches any pattern with one or more characters (“.” for any character, “+” for “one or more”) followed by a literal colon (“:”) and then the end of the line (“$”).

The first part reduces the chat data frame to all columns that either have a colon in columns 5 (“grepl(“.+:$”, chat[,5])”) or (“|”) in column 4 (“grepl(“.+:$”, chat[,4])”). Of course, the stampless rows we just created are left in as well (“is.na(chat[,1])”). This effectively removes the notifications.

In the second chunk, we move the text parts in the surname column to the end of the line, the same way we shifted the rows without time stamp. By now, our file looks something like this (check yours with View(chat)):

chat2

The bigger part of our work is done. We’ll just format the time stamp in such a way that R can evaluate it and make a few cosmetic adjustments.

Converting the time stamp to date format

For R to convert the first two columns into a format it can work with, we’ll have to help it a bit. First, we’ll copy down time stamps and name from the previous row to all the rows we shifted before.

Then, we’ll clean the first few columns a bit, deleting the column with the dash, merging the time stamp so its all in one place and naming the first few columns.

Now, we can easily convert the first column into a date format. R has a few different classes for date and time objects. We’ll use the strptime() function, which produces an object of class “Posixlt”. If you want to know more about dates and times in R, again, the swirl lesson on that topic is great.

We need to tell strptime in which format the date is stored. In our case, it’s
“<day>.<month>.<full year>, <hours>:<minutes>”. In strptime() language, this is written as “%d.%m.%Y, %H:%M”.

One last cosmetic edit: The names still have colons at the end. This issue is easily solved with — you guessed it — regular expressions! We can use the gsub() function to search and replace patterns. We’ll use it on the “name” and “surname” column by replacing every colon at the end of a line with nothing, like this:

Congratulations, you’ve cleaned up the entire dataset! It should now look like this — no empty lines, no colons or text in the name columns and a wonderfully formatted time stamp.

chat3

Saving the data

Now, the only thing left to do is to save our beautiful, sparkly clean dataset to a new file. If you want to work with the file in another programm except R, you can use, for example, the write.table() or write.csv() function to export your data frame. Since we want to continue working in R for our visualizations, we’ll go with save() for now. It will create an .Rdata file that can be read back into R easily with the load() function.

There you go, all done! Give yourself a big pat on the back, because cleaning data is hard.

If you want to continue right away, check out part two of our WhatsApp project where we visualize the data we just cleaned. If you need help with the cleaning script or have suggestions on how to improve it, write us an e-mail or join our slack team. Our help and discussion channels are open for everyone!

 

Part 2 | Code

 

{Credits for the awesome featured image go to Phil Ninh}

R crash course: Writing functions

R crash course: Writing functions

As you know by now, R is all about functions. In the event that there isn’t one for the exact thing you want to do, you can even write your own! Writing your own functions is a very useful way to automate your work. Once defined, it’s easy to call new functions as often as you need. It’s a good habit to get into when programming with R — and with lots of other languages as well.

Defining a function uses another function simply called function(). Function names follow pretty much the same rules as variable names, so you can call them anything that would also be acceptable as a variable name.

Let’s try an easy example to see how function definitions work:

A function of questionable usefulness: It essentially does the same thing as print(). It takes an argument called x, and prints whatever you put as x to the console.

Theoretically, you can make your function take as many arguments as you want. Just write them in the parentheses of function(). You can call the arguments however you want, too. Also, your functions will probably often require more than one line. In that case, just put whatever you want your function to do in curly brackets {}. It will look somewhat like this:

Let’s mess with that one a bit! Run the following code line by line and try to guess what went wrong.

Possible errors while writing functions

Errors aren’t just a necessary evil in coding. By making mistakes, you get to know your programming language better and find out what works — and, of course, what doesn’t work. Let’s go through the errors one by one:

  • squareadd(3): You passed the function only one argument (3, which was attributed to the “x” argument) to work with when it expected two values, one for x and one for y.
  • squareadd(3,”two”): Now you passed the function two arguments, but one’s not a number. It’s a character, since it has quotes around it. But R can’t execute the function with a character. After all, what is 3^2 + “two” supposed to mean?
  • squareadd(3,two): No quotes this time in the second argument. Because the “y” argument is not in quotes and not a number, either, R assumes it’s a variable or some other object. Problem is: R can’t find the object called two anywhere
  • After you define the object two to be equal to 2, though, R does find a matching object to put as an argument. So this time around, squareadd(3,two) should return the number 11

After we change the function definition to include only the “x” argument, the errors we get change a little. Note that we there’s still a “y” in the function body.

  • squareadd2(3,2): Other way around this time. Your function expected only one argument, but got two.
  • squareadd2(3): You passed the correct number of arguments, but R can’t find anything to use for the y in the function body, neither inside the function nor in the global environment.
  • This is why, after you defined y to be equal to four in the global environment, squareadd2(3) works fine and will return 13 (since 3^2 + 4 = 13).

Scoping Rules in R

Some of the errors you’ll get, such as those in the last two lines, are due to something called the scoping rules of R. These rules define how R looks for the variables it needs to execute a function. It does that by looking through different environments — sub-spaces of your working environment that have their own variables and object definitions — in a certain order. There’s two basic types of scoping:

  • Lexical scoping: Looking for missing objects in the environment where the function was defined.
  • Dynamic scoping: Looking for missing objects in the environment where the function was called.

R uses lexical scoping. So if it doesn’t find the stuff it needs within the function (which, incidentally, has its own little environment), it goes on to look in the environment where the function was defined. In many cases, this will be the global environment, which is what you’re coding in if you’re not inside a specific function. If it doesn’t find what it needs there either, it will continue down the search list of environments. You can take a look at the list by typing search() into your console.

Let’s take a quick look at the difference between dynamic and lexical scoping. Look at the following code and try to guess its output. Execute it in RStudio and see if you’re right.

The output depends on the scoping rules your programming languages uses. As you just learned, R uses lexical scoping. So if you call check(), a is set to FALSE only on the function environment of check(). But since istrue() was defined in the global environment, where a is still equal to TRUE, it will print “that’s right!” to your console. If R used dynamic scoping, it would go with a <- FALSE, since that is accurate for the environment where istrue() was called.

You don’t have to worry too much about the specifics of scoping rules and environments when starting to code, but it’s a useful thing to keep in mind. There’s lots of good info on scoping, searching and environments in R on the web, as well as more tutorials on writing your own functions. We’ll be putting together some resources on our website soon, so stay tuned for that.

But for now — well done! That was a lot of new info to process. print() yourself a “Good job!” to the console before you go on and practice writing some more functions. We’re looking forward to your coding experiences!

Bonus round: Can you count how often the word “function” appears in this text? Guess right and win a complimentary function congratulating you on your newly acquired coding skills.

 

{Credits for the awesome featured image go to Phil Ninh}

R crash course: Workspace, packages and data import

R crash course: Workspace, packages and data import

In this crash course section, we’ll talk about importing all sorts of data into R and installing fancy new packages. Also, we’ll learn to know our way around the workspace.

Your workspace in R is like the desk you work at. It’s where all the data, defined variables and other objects you’re currently working with are stored. Like with a desk, you might want to clean it every once in a while and throw out stuff you don’t need any more. There’s a few useful commands to help you do that. Take a look and try them out: