Exploring the Data Jungle
Finding, Preparing, and Using Real-World Data
Brian Godsey
  • July 2017
  • ISBN 9781617295065
  • 101 pages

Some people like to believe that all data is ready to be used immediately. Not so! Data in the wild is hard to track and harder to understand, and the first job of data scientists to identify and prepare data so it can be used. To find your way through the data jungle successfully, you need the right perspective and guidance. (There's no point hacking at overgrowth with a spoon after all!) Identify and prepare your data well, and you'll be well set to create insight from chaos and discover important analytic patterns - to set your business on the right track.

Exploring the Data Jungle: Finding, Preparing, and Using Real-World Data is a collection of three hand-picked chapters introducing you to the often-overlooked art of putting unfamiliar data to good use. Brian Godsey, author of Think Like a Data Scientist, has selected these chapters to help you navigate data in the wild, identify and prepare raw data for analysis, modeling, machine learning, or visualization. As you explore the data jungle you'll discover real-world examples in Python, R, and other languages suitable for data science.

Table of Contents detailed table of contents



Data All Around Us: The Virtual Wilderness

Data All Around Us: The Virtual Wilderness

3.1 Data as the object of study

3.1.1 The users of computers and the internet became data generators

3.1.2 Data for its own sake

3.1.3 Data scientist as explorer

3.2 Where data might live, and how to interact with it

3.2.1 Flat files

3.2.2 HTML

3.2.3 XML

3.2.4 JSON

3.2.5 Relational databases

3.2.6 Non-relational databases

3.2.7 APIs

3.2.8 Common bad formats

3.2.9 Unusual formats

3.2.10 Deciding which format to use

3.3 Scouting for data

3.3.3 The data you have: is it enough?

3.3.4 Combining data sources

3.3.5 Web scraping

3.3.6 Measuring or collecting things yourself

3.4 Example: microRNA and gene expression



What’s inside

Exploring Data

Exploring Data

3.1 Using summary statistics to spot problems

3.1.1 Typical problems revealed by data summaries

3.2 Spotting problems using graphics and visualization

3.2.1 Visually checking distributions for a single variable

3.2.2 Visually checking relationships between two variables

3.3 Summary

What’s inside

Real-world Data

Real-world Data

2.1 Getting started: data collection

2.1.1 Which features should be included?

2.1.2 How can we obtain ground truth for the target variable?

2.1.3 How much training data is required?

2.1.4 Is the training set representative enough?

2.2 Preprocessing the data for modeling

2.2.1 Categorical features

2.2.2 Dealing with missing data

2.2.3 Simple feature engineering

2.2.4 Data normalization

2.3 Using data visualization

2.3.1 Mosaic plots

2.3.2 Box plots

2.3.3 Density plots

2.3.4 Scatter plots

2.4 Summary

2.5 Terms from this chapter

What’s inside


About the author

Brian Godsey has worked in software, academia, finance, and defense and has launched several data-centric start-ups.

placing your order...

Don't refresh or navigate away from the page.
eBook $0.00 PDF only + liveBook
Check your email for instructions on downloading Exploring the Data Jungle (eBook) or read it now
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.
customers also reading

This book 1-hop 2-hops 3-hops

FREE domestic shipping on three or more pBooks