Data Munging with R
Dr. Jonathan Carroll
  • MEAP began June 2017
  • Publication in February 2018 (estimated)
  • ISBN 9781617294594
  • 375 pages (estimated)
  • printed in black & white

Data Munging with R shows you how to take raw data and transform it for use in computations, tables, graphs, and more. Whether you already have some programming experience or you're just a spreadsheet whiz looking for a more powerful data manipulation tool, this book will help you get started. You'll discover the ins and outs of using the data-oriented R programming language and its many task-specific packages. With dozens of practical examples to follow, learn to fill in missing values, make predictions, and visualize data as graphs. By the time you're done, you'll be a master munger, with a robust, reproducible workflow and the skills to use data to strengthen your conclusions!

Table of Contents detailed table of contents

1. Introducing Data and the R Language

1.1. Data — What, Where, How?

1.1.1. What is Data?

1.1.2. Seeing the World as Data Sources

1.1.3. What You Can Do With Well-Handled Data

1.1.4. Data as an Asset

1.1.5. Reproducible Research and Version Control

1.2. Introducing R

1.2.1. The Origins of R

1.2.2. What It Is and What It Isn’t

1.3. How R Works

1.4. Introducing RStudio

1.4.1. Working with R within RStudio

1.4.2. Inbuilt Packages (Data and Functions)

1.5. In-built Documentation

1.5.1. Vignettes

1.6. Try It Yourself

1.7. Summary

2. Getting to Know R Data Types

2.1. Types of Data

2.1.1. Numbers

2.1.2. Text (Strings)

2.1.3. Categories (Factors)

2.1.4. Dates and Times

2.1.5. Logicals

2.1.6. Missing Values

2.2. Storing Values (Assigning)

2.2.1. Naming Data (Variables)

2.2.2. Unchanging Data

2.2.3. The Assigmnent Operators (<- vs =)

2.3. Specifying the Data Type

2.4. Telling R to Ignore Something

2.5. Summary

3. I Want To Make New Data Values

3.1. Basic Mathematics

3.2. Operator Precedence

3.3. String Concatenation (Joining)

3.4. Comparisons

3.5. Automatic Conversion (Coercion)

3.6. Try It Yourself

3.7. Summary

4. Understanding the Tools We’ll Use — Functions

4.1. Functions

4.1.1. Under the Hood

4.1.2. Function Template

4.1.3. Arguments

4.1.4. Multiple Arguments

4.1.5. Default Arguments

4.1.6. Argument Name Matching

4.1.7. Partial Matching

4.1.8. Scope

4.2. Packages

4.2.1. How Does R (Not?) Know About This Function?

4.3. Messages, Warnings, and Errors, Oh My!

4.3.1. How To Diagnose Them

4.4. Testing

4.5. Project 4.1 — Generalise a Function

4.6. Try It Yourself

4.7. Summary

5. I Want To Combine Data Values

5.1. Simple Collections

5.1.1. Coercion

5.1.2. Missing Values

5.1.3. Attributes

5.1.4. Names

5.2. Sequences

5.2.1. Vector Math Operations

5.3. Matrices

5.3.1. Indexing

5.4. Lists

5.5. data.frame 's

5.6. Classes

5.7. tibble

5.7.1. Structures as Function Arguments

5.8. Try It Yourself

5.9. Summary

6. I Want To Select Certain Data Values

6.1. Text Processing

6.1.1. Text Matching

6.1.2. Substrings

6.1.3. Text Substitutions

6.1.4. Regular Expressions

6.2. Selecting Components from Structures

6.2.1. Vectors

6.2.2. Lists

6.2.3. Matrices

6.3. Replacing Values

6.4. data.frames and dplyr

6.4.1. Verbs

6.4.2. Non-Standard Evaluation

6.4.3. Pipes

6.4.4. Subsetting data.frame The Hard Way

6.5. Replacing NA

6.6. Selecting Conditionally

6.7. Summarising Values

6.8. A Worked Example; Excel vs R

6.9. Try It Yourself

6.10. Summary

7. I Want To Do Something With Lots of Data

7.1. Tidy Data Principles

7.1.1. The Working Directory

7.1.2. Stored Data Formats

7.1.3. Reading Data into R

7.1.4. Scraping Data

7.1.5. Inspecting Data

7.1.6. I Have Odd Values In My Data (Sentinel Values)

7.1.7. Converting to Tidy Data

7.2. Merging Data

7.3. Writing Data From R

7.4. Try It Yourself

7.5. Summary

8. I Want To Do Something Conditionally (Control Structures)

8.1. Looping

8.1.1. Vectorisation

8.1.2. Tidy Repetition (Looping with purrr)

8.1.3. for loops

8.1.4. while loops

8.2. Conditional Evaluation

8.2.1. if conditions

8.2.2. ifelse conditions

8.3. Common Mistakes

8.4. Try It Yourself

8.5. Summary

9. I Want I Want To Visualise My Data (Plotting)

9.1. Data Preparation

9.1.1. Tidy Data, Revisited

9.1.2. Importance of Data Types

9.2. ggplot2

9.2.1. General Construction

9.2.2. Adding Points

9.2.3. Style Aesthetics

9.2.4. Adding Lines

9.2.5. Adding Bars

9.2.6. Other Types of Plots

9.2.7. Scales

9.2.8. Facetting

9.2.9. Additional Options

9.3. Plots as Objects

9.4. Base R graphics

9.5. Saving plots

9.6. Try It Yourself

9.7. Summary

10. I Want To Do More With My Data (Extensions)

10.1. Writing Your Own Packages

10.1.1. Creating a Minimal Package

10.1.2. Documentation

10.2. Analysing Your Package

10.2.1. Testing

10.2.2. Profiling

10.3. What To Do Next?

10.3.1. Regression

10.3.2. Clustering

10.3.3. Working With Maps

10.3.4. Interacting With APIs

10.3.5. Sharing Your Package

10.4. More Resources

10.5. Summary

Appendixes

Appendix A: Installing R

Appendix B: Installing RStudio

About the Technology

Data munging - manipulating raw data - is a cornerstone of data science. Munging techniques include cleaning, sorting, parsing, filtering, and pretty much anything else you need to make data truly useful. The R language, with its intuitive RStudio environment, is the perfect data munging tool. R provides a rich ecosystem of community-driven packages and utilities for finance and accounting, marketing, web-scraping, and all manner of data science tasks. And getting started with R is so easy, even managers have been known to use it for ad hoc data analysis!

What's inside

  • Learning to program
  • Critical R structures and operators
  • Handling R packages
  • Tidying and refining your data
  • Plotting your data

About the reader

If you have beginner programming skills or you're comfortable with writing spreadsheet formulas, you have everything you need to get the most out of this book.

About the author

Dr Jonathan Carroll holds a PhD from the University of Adelaide in theoretical astrophysics, currently working in statistical modelling. He contributes packages to R, is a frequent contributor of answers on StackOverflow and an avid science communicator.


Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
Buy
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook

FREE domestic shipping on three or more pBooks