Exploring the Data Jungle
Finding, Preparing, and Using Real-World Data
Brian Godsey
  • July 2017
  • ISBN 9781617295065
  • 101 pages

Some people like to believe that all data is ready to be used immediately. Not so! Data in the wild is hard to track and harder to understand, and the first job of data scientists to identify and prepare data so it can be used. To find your way through the data jungle successfully, you need the right perspective and guidance. (There's no point hacking at overgrowth with a spoon after all!) Identify and prepare your data well, and you'll be well set to create insight from chaos and discover important analytic patterns - to set your business on the right track.

Exploring the Data Jungle: Finding, Preparing, and Using Real-World Data is a collection of three hand-picked chapters introducing you to the often-overlooked art of putting unfamiliar data to good use. Brian Godsey, author of Think Like a Data Scientist, has selected these chapters to help you navigate data in the wild, identify and prepare raw data for analysis, modeling, machine learning, or visualization. As you explore the data jungle you'll discover real-world examples in Python, R, and other languages suitable for data science.

Table of Contents detailed table of contents

contents

Introduction

Data All Around Us: The Virtual Wilderness

Data All Around Us: The Virtual Wilderness

3.1 Data as the object of study

3.1.1 The users of computers and the internet became data generators

3.1.2 Data for its own sake

3.1.3 Data scientist as explorer

3.2 Where data might live, and how to interact with it

3.2.1 Flat files

3.2.2 HTML

3.2.3 XML

3.2.4 JSON

3.2.5 Relational databases

3.2.6 Non-relational databases

3.2.7 APIs

3.2.8 Common bad formats

3.2.9 Unusual formats

3.2.10 Deciding which format to use

3.3 Scouting for data

3.3.3 The data you have: is it enough?

3.3.4 Combining data sources

3.3.5 Web scraping

3.3.6 Measuring or collecting things yourself

3.4 Example: microRNA and gene expression

Exercises

Summary

What’s inside

Exploring Data

Exploring Data

3.1 Using summary statistics to spot problems

3.1.1 Typical problems revealed by data summaries

3.2 Spotting problems using graphics and visualization

3.2.1 Visually checking distributions for a single variable

3.2.2 Visually checking relationships between two variables

3.3 Summary

What’s inside

Real-world Data

Real-world Data

2.1 Getting started: data collection

2.1.1 Which features should be included?

2.1.2 How can we obtain ground truth for the target variable?

2.1.3 How much training data is required?

2.1.4 Is the training set representative enough?

2.2 Preprocessing the data for modeling

2.2.1 Categorical features

2.2.2 Dealing with missing data

2.2.3 Simple feature engineering

2.2.4 Data normalization

2.3 Using data visualization

2.3.1 Mosaic plots

2.3.2 Box plots

2.3.3 Density plots

2.3.4 Scatter plots

2.4 Summary

2.5 Terms from this chapter

What’s inside

Index

About the author

Brian Godsey has worked in software, academia, finance, and defense and has launched several data-centric start-ups.


Buy
eBook $0.00 PDF only

FREE domestic shipping on three or more pBooks