Exploring the Data Jungle
Finding, Preparing, and Using Real-World Data
Brian Godsey
  • July 2017
  • ISBN 9781617295065
  • 101 pages

Some people like to believe that all data is ready to be used immediately. Not so! Data in the wild is unkempt and unruly, and it's the job of data scientists to clean up raw data into something that's ready to be used. To manage the data jungle, you need the right perspective and the right tools. (There's no point hacking at overgrowth with a spoon after all!) Do your work well, and you'll create insight from chaos and discover the analytic patterns to your business on the right track.

Exploring the Data Jungle: Finding, Preparing, and Using Real-World Data is a collection of three hand-picked chapters introducing you to the often-overlooked art of cleaning data. Brian Godsey, author of Think Like a Data Scientist, has selected these chapters to help you navigate data in the wild, process and prepare raw data for machine learning, and visualize results clearly. As you explore the data jungle you'll discover real-world examples in Python, R, and other languages suitable for data science.

Table of Contents detailed table of contents



Data All Around Us: The Virtual Wilderness

Data All Around Us: The Virtual Wilderness

3.1 Data as the object of study

3.1.1 The users of computers and the internet became data generators

3.1.2 Data for its own sake

3.1.3 Data scientist as explorer

3.2 Where data might live, and how to interact with it

3.2.1 Flat files

3.2.2 HTML

3.2.3 XML

3.2.4 JSON

3.2.5 Relational databases

3.2.6 Non-relational databases

3.2.7 APIs

3.2.8 Common bad formats

3.2.9 Unusual formats

3.2.10 Deciding which format to use

3.3 Scouting for data

3.3.3 The data you have: is it enough?

3.3.4 Combining data sources

3.3.5 Web scraping

3.3.6 Measuring or collecting things yourself

3.4 Example: microRNA and gene expression



What’s inside

Exploring Data

Exploring Data

3.1 Using summary statistics to spot problems

3.1.1 Typical problems revealed by data summaries

3.2 Spotting problems using graphics and visualization

3.2.1 Visually checking distributions for a single variable

3.2.2 Visually checking relationships between two variables

3.3 Summary

What’s inside

Real-world Data

Real-world Data

2.1 Getting started: data collection

2.1.1 Which features should be included?

2.1.2 How can we obtain ground truth for the target variable?

2.1.3 How much training data is required?

2.1.4 Is the training set representative enough?

2.2 Preprocessing the data for modeling

2.2.1 Categorical features

2.2.2 Dealing with missing data

2.2.3 Simple feature engineering

2.2.4 Data normalization

2.3 Using data visualization

2.3.1 Mosaic plots

2.3.2 Box plots

2.3.3 Density plots

2.3.4 Scatter plots

2.4 Summary

2.5 Terms from this chapter

What’s inside


About the author

Brian Godsey has worked in software, academia, finance, and defense and has launched several data-centric start-ups.

eBook $0.00 PDF only

FREE domestic shipping on three or more pBooks