Think Like a Data Scientist
Tackle the data science process step-by-step
Brian Godsey
  • March 2017
  • ISBN 9781633430273
  • 328 pages
  • printed in black & white

Explains difficult concepts and techniques concisely and approachably.

Jenice Tom, CVS Health

Think Like a Data Scientist presents a step-by-step approach to data science, combining analytic, programming, and business perspectives into easy-to-digest techniques and thought processes for solving real world data-centric problems.

Table of Contents detailed table of contents

Part 1: Preparing and gathering data and knowledge

1. Philosophies of data science

1.1. Data science and this book

1.2. Awareness is valuable

1.3. Developer vs. Data Scientist

1.4. Do I need to be a software developer?

1.5. Do I need to know statistics?

1.6. Priorities: knowledge 1st, technology 2nd, opinions 3rd

1.7. Best practices

1.7.1. Documentation

1.7.2. Code repositories and versioning

1.7.3. Code organization

1.7.4. Ask questions

1.7.5. Stay close to the data

1.8. Reading this book: how I discuss concepts

1.9. Summary

2. Setting goals by asking good questions

2.1. Listening to the customer

2.1.1. Resolving wishes and pragmatism

2.1.2. The customer is probably not a data scientist

2.1.3. Asking very specific questions to uncover fact, not opinions

2.1.4. Suggesting deliverables: guess and check

2.1.5. Iterate your ideas based on knowledge, not wishes

2.2. Ask good questions---of the data

2.2.1. Good questions are concrete in their assumptions

2.2.2. Good answers: measurable success without too much cost

2.3. Answering the question using data

2.3.1. Are the data relevant and sufficient?

2.3.2. Has someone done this before?

2.3.3. Figuring out what data and software you could use

2.3.4. Anticipate obstacles to getting everything you want

2.4. Setting goals

2.4.1. What is possible?

2.4.2. What is valuable?

2.4.3. What is efficient?

2.5. Planning: be flexible

2.6. Exercises

2.7. Summary

3. Data all around us: the virtual wilderness

3.1. Data as the object of study

3.1.1. The users of computers and the Internet became data generators

3.1.2. Data for its own sake

3.1.3. Data scientist as explorer

3.2. Where data might live, and how to interact with it

3.2.1. "Flat" files

3.2.2. HTML

3.2.3. XML

3.2.4. JSON

3.2.5. Relational databases

3.2.6. Non-relational databases

3.2.7. APIs

3.2.8. Common bad formats

3.2.9. Unusual formats

3.2.10. Which format to use

3.3. Scouting for data

3.3.3. The data you have: is it enough?

3.3.4. Combining data sources

3.3.5. Web scraping

3.3.6. Measuring or collecting things yourself

3.4. Example: microRNA and gene expression

3.5. Exercises

3.6. Summary

4. Data wrangling: from capture to domestication

4.1. Case study: best all-time performances in track and field

4.1.1. Common heuristic comparisons

4.1.2. IAAF Scoring Tables

4.1.3. Comparing performances using all data available

4.2. Getting ready to wrangle

4.2.1. Some types of messy data

4.2.2. Pretend you are an algorithm

4.2.3. Keep imagining; what are the possible obstacles and uncertainties?

4.2.4. Look at the end of the data and the file

4.2.5. Make a plan

4.3. Techniques and tools

4.3.1. File format converters

4.3.2. Proprietary data wranglers

4.3.3. Scripting: use the plan, but then guess-and-check

4.4. Common pitfalls

4.4.1. Watch out for Windows/Mac/Linux problems

4.4.2. Escape characters

4.4.3. The Outliers

4.4.4. Horror stories around the Wranglers' Campfire

4.5. Exercises

4.6. Summary

5. Data assessment: poking and prodding

5.1. Example: the Enron email data set

5.2. Descriptive statistics

5.2.1. Stay close to the data

5.2.2. Common descriptive statistics

5.2.3. Choosing specific statistics to calculate

5.2.4. Make tables or graphs where appropriate

5.3. Check assumptions about the data

5.3.1. Assumptions about the contents of the data

5.3.2. Assumptions about the distribution of the data

5.3.3. A handy trick for uncovering your assumptions

5.4. Looking for something specific

5.4.1. Find a few examples

5.4.2. Characterize the examples: what makes them different?

5.4.3. Data snooping (or not)

5.5. Rough statistical analysis

5.5.1. Dumb it down

5.5.2. Take a subset of the data

5.5.3. Increasing sophistication: does it improve results?

5.6. Exercises

5.7. Summary

Part 2: Building a product with software and statistics

6. Developing a plan

6.1. What have you learned?

6.1.1. Examples

6.1.2. Evaluating what you have learned

6.2. Reconsidering expectations and goals

6.2.1. Unexpected new information

6.2.2. Adjusting goals

6.2.3. Consider more exploratory work

6.3. Planning

6.3.1. Examples

6.4. Communicating new goals

6.5. Exercises

6.6. Summary

7. Statistics and modeling: concepts and foundations

7.1. How I think about statistics

7.2. Statistics: the field as it relates to data science

7.2.1. What statistics is

7.2.2. What statistics is not

7.3. Mathematics

7.3.1. Example: long division

7.3.2. Mathematical models

7.3.3. Mathematics vs. Statistics

7.4. Statistical modeling and inference

7.4.1. Defining a statistical model

7.4.2. Latent variables

7.4.3. Quantifying uncertainty: randomness, variance, and error terms

7.4.4. Fitting a model

7.4.5. Bayesian vs. frequentist statistics

7.4.6. Drawing conclusions from models

7.5. Miscellaneous statistical methods

7.5.1. Clustering

7.5.2. Component analysis

7.5.3. Machine learning and "black box" methods

7.6. Exercises

7.7. Summary

8. Software: statistics in action

8.1. Spreadsheets and GUI-based applications

8.1.1. Spreadsheets

8.1.2. Other GUI-based statistical applications

8.1.3. Data science for the masses

8.2. Programming

8.2.1. Getting started with programming

8.2.2. Languages

8.3. Choosing statistical software tools

8.3.1. Does the tool have an implementation of the methods?

8.3.2. Flexibility is good

8.3.3. Informative is good

8.3.4. Common is good

8.3.5. Well-documented is good

8.3.6. Purpose-built is good

8.3.7. Interoperability is good

8.3.8. Permissive licenses are good

8.3.9. Knowledge and familiarity are good

8.4. Translating statistics into software

8.4.1. Using built-in methods

8.4.2. Writing your own methods

8.5. Exercises

8.6. Summary

9. Supplementary software: bigger, faster, more efficient

9.1. Databases

9.1.1. Types of databases

9.1.2. Benefits of databases

9.1.3. How to use databases

9.1.4. When to use databases

9.2. High-performance computing

9.2.1. Types of HPC

9.2.2. Benefits of HPC

9.2.3. How to use HPC

9.2.4. When to use HPC

9.3. Cloud services

9.3.1. Types of cloud services

9.3.2. Benefits of cloud services

9.3.3. How to use cloud services

9.3.4. When to use cloud services

9.4. Big data technologies

9.4.1. Types of big data technologies

9.4.2. Benefits of big data technologies

9.4.3. How to use big data technologies

9.4.4. When to use big data technologies

9.5. Anything as a Service (AaaS)

9.6. Exercises

9.7. Summary

10. Plan execution: putting it all together

10.1. Tips for executing the plan

10.1.1. If you’re a statistician

10.1.2. If you’re a software engineer

10.1.3. If you’re a beginner

10.1.4. If you’re a member of a team

10.1.5. If you’re leading a team

10.2. Modifying the plan in progress

10.2.1. Sometimes the goals change

10.2.2. Something might be more difficult than you thought

10.2.3. Sometimes you realize you made a bad choice

10.3. Results: knowing when they’re good enough

10.3.1. Statistical significance

10.3.2. Practical usefulness

10.3.3. Re-evaluating your original accuracy and significance goals

10.4. Case study: protocols for measurement of gene activity

10.4.1. The project

10.4.2. What I knew

10.4.3. What I needed to learn

10.4.4. The resources

10.4.5. The statistical model

10.4.6. The software

10.4.7. The plan

10.4.8. The results

10.4.9. Submitting for publication and feedback

10.4.10. How it ended

10.5. Exercises

10.6. Summary

Part 3: Finishing off the product and wrapping up

11. Delivering a product

11.1. Understanding your customer

11.1.1. Who is the entire audience for the results?

11.1.2. What will be done with the results?

11.2. Delivery media

11.2.1. Report or white paper

11.2.2. Analytical tool

11.2.3. Interactive graphical application

11.2.4. Instructions for how to re-do the analysis

11.2.5. Other types of products

11.3. Content

11.3.1. Make important, conclusive results prominent

11.3.2. Don’t include results that are virtually non-conclusive

11.3.3. Include obvious disclaimers for less significant results

11.3.4. User experience

11.4. Example: analyzing video game play

11.5. Exercises

11.6. Summary

12. After product delivery: problems and revisions

12.1. Problems with the product and its use

12.1.1. Customers not using the product correctly

12.1.2. UX problems

12.1.3. Software bugs

12.1.4. The product doesn’t solve real problems

12.2. Feedback

12.2.1. Feedback means someone is using your product

12.2.2. Feedback is not disapproval

12.2.3. Read between the lines

12.2.4. Ask for feedback if you must

12.3. Product revisions

12.3.1. Uncertainty can make revisions necessary

12.3.2. Designing revisions

12.3.3. Engineering revisions

12.3.4. Deciding which revisions to make

12.4. Exercises

12.5. Summary

13. Wrapping up: putting the project away

13.1. Putting the project away neatly

13.1.1. Documentation

13.1.2. Storage

13.1.3. Thinking ahead to future scenarios

13.1.4. Best practices

13.2. Learning from the project

13.2.1. Project post mortem

13.3. Looking towards the future

13.4. Exercises

13.5. Summary

About the Technology

Data collected from customers, scientific measurements, IoT sensors, and so on is valuable only if you understand it. Data scientists revel in the interesting and rewarding challenge of observing, exploring, analyzing, and interpreting this data. Getting started with data science means more than mastering analytic tools and techniques, however; the real magic happens when you begin to think like a data scientist. This book will get you there.

About the book

Think Like a Data Scientist teaches you a step-by-step approach to solving real-world data-centric problems. By breaking down carefully crafted examples, you'll learn to combine analytic, programming, and business perspectives into a repeatable process for extracting real knowledge from data. As you read, you'll discover (or remember) valuable statistical techniques and explore powerful data science software. More importantly, you'll put this knowledge together using a structured process for data science. When you've finished, you'll have a strong foundation for a lifetime of data science learning and practice.

What's inside

  • The data science process, step-by-step
  • How to anticipate problems
  • Dealing with uncertainty
  • Best practices in software and scientific thinking

About the reader

Readers need beginner programming skills and knowledge of basic statistics.

About the author

Brian Godsey has worked in software, academia, finance, and defense and has launched several data-centric start-ups.


Buy
combo $44.99 pBook + eBook
eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

Goes beyond simple tools and techniques and helps you to conceptualize and solve challenging, real-world data science problems.

Casimir Saternos, Synchronoss Technologies

A successful attempt to put the mind of a data scientist on paper.

David Krief, Altansia

The book that changed my career path!

Nicolas Boulet-Lavoie, DL Innov