Think Like a Data Scientist
Tackle the data science process step-by-step
Brian Godsey
  • MEAP began August 2016
  • Publication in March 2017 (estimated)
  • ISBN 9781633430273
  • 340 pages (estimated)
  • printed in black & white

Think Like a Data Scientist presents a step-by-step approach to data science, combining analytic, programming, and business perspectives into easy-to-digest techniques and thought processes for solving real world data-centric problems. This book helps you fill in conceptual knowledge gaps in the daunting fields of statistics and software development, and relates those skills to the real concerns of data science in the business world. As you work though the many practical examples, you'll use your existing knowledge of statistics and programming to solve real problems in data science.

Table of Contents detailed table of contents

Part 1: Preparing and gathering data and knowledge

1. Philosophies of data science

1.1. Data science and this book

1.2. Awareness is valuable

1.3. Developer vs. Data Scientist

1.4. Do I need to be a software developer?

1.5. Do I need to know statistics?

1.6. Priorities: knowledge 1st, technology 2nd, opinions 3rd

1.7. Best practices

1.7.1. Documentation

1.7.2. Code repositories and versioning

1.7.3. Code organization

1.7.4. Ask questions

1.7.5. Stay close to the data

1.8. Reading this book: how I discuss concepts

1.9. Summary

2. Setting goals by asking good questions

2.1. Listening to the customer

2.1.1. Resolving wishes and pragmatism

2.1.2. The customer is probably not a data scientist

2.1.3. Asking very specific questions to uncover fact, not opinions

2.1.4. Suggesting deliverables: guess and check

2.1.5. Iterate your ideas based on knowledge, not wishes

2.2. Ask good questions---of the data

2.2.1. Good questions are concrete in their assumptions

2.2.2. Good answers: measurable success without too much cost

2.3. Answering the question using data

2.3.1. Are the data relevant and sufficient?

2.3.2. Has someone done this before?

2.3.3. Figuring out what data and software you could use

2.3.4. Anticipate obstacles to getting everything you want

2.4. Setting goals

2.5. Planning: be flexible

2.6. Exercises

2.7. Summary

3. Data all around us: the virtual wilderness

3.1. Data as the object of study

3.1.1. The users of computers and the Internet became data generators

3.1.2. Data for its own sake

3.1.3. Data scientist as explorer

3.2. Where data might live, and how to interact with it

3.2.1. "Flat" files

3.2.2. HTML

3.2.3. XML

3.2.4. JSON

3.2.5. Relational databases

3.2.6. Non-relational databases

3.2.7. APIs

3.2.8. Common bad formats

3.2.9. Unusual formats

3.2.10. Which format to use

3.3. Scouting for data

3.3.3. The data you have: is it enough?

3.3.4. Combining data sources

3.3.5. Web scraping

3.3.6. Measuring or collecting things yourself

3.4. Example: microRNA and gene expression

3.5. Exercises

3.6. Summary

4. Data wrangling: from capture to domestication

4.1. Case study: best all-time performances in track and field

4.1.1. Common heuristic comparisons

4.1.2. IAAF Scoring Tables

4.1.3. Comparing performances using all data available

4.2. Getting ready to wrangle

4.2.1. Some types of messy data

4.2.2. Pretend you are an algorithm

4.2.3. Keep imagining; what are the possible obstacles and uncertainties?

4.2.4. Look at the end of the data and the file

4.2.5. Make a plan

4.3. Techniques and tools

4.3.1. File format converters

4.3.2. Proprietary data wranglers

4.3.3. Scripting: use the plan, but then guess-and-check

4.4. Common pitfalls

4.4.1. Watch out for Windows/Mac/Linux problems

4.4.2. Escape characters

4.4.3. The Outliers

4.4.4. Horror stories around the Wranglers' Campfire

4.5. Exercises

4.6. Summary

5. Data assessment: poking and prodding

5.1. Example: the Enron email data set

5.2. Descriptive statistics

5.2.1. Stay close to the data

5.2.2. Common descriptive statistics

5.2.3. Choosing specific statistics to calculate

5.2.4. Make tables or graphs where appropriate

5.3. Check assumptions about the data

5.3.1. Assumptions about the contents of the data

5.3.2. Assumptions about the distribution of the data

5.3.3. A handy trick for uncovering your assumptions

5.4. Looking for something specific

5.4.1. Find a few examples

5.4.2. Characterize the examples: what makes them different?

5.4.3. Data snooping (or not)

5.5. Rough statistical analysis

5.5.1. Dumb it down

5.5.2. Take a subset of the data

5.5.3. Increasing sophistication: does it improve results?

5.6. Exercises

5.7. Summary

Part 2: Building a product with software and statistics

6. Developing a plan

6.1. What have you learned?

6.1.1. Examples

6.1.2. Evaluating what you have learned

6.2. Reconsidering expectations and goals

6.2.1. Unexpected new information

6.2.2. Adjusting goals

6.2.3. Consider more exploratory work

6.3. Planning

6.3.1. Examples

6.4. Communicating new goals

6.5. Exercises

6.6. Summary

7. Statistics and modeling: concepts and foundations

7.1. How I think about statistics

7.2. Statistics: the field as it relates to data science

7.2.1. What statistics is

7.2.2. What statistics is not

7.3. Mathematics

7.3.1. Example: long division

7.3.2. Mathematical models

7.3.3. Mathematics vs. Statistics

7.4. Statistical modeling and inference

7.4.1. Defining a statistical model

7.4.2. Latent variables

7.4.3. Quantifying uncertainty: randomness, variance, and error terms

7.4.4. Fitting a model

7.4.5. Bayesian vs. frequentist statistics

7.4.6. Drawing conclusions from models

7.5. Miscellaneous statistical methods

7.5.1. Clustering

7.5.2. Component analysis

7.5.3. Machine learning and "black box" methods

7.6. Exercises

7.7. Summary

8. Software: statistics in action

8.1. Spreadsheets and GUI-based applications

8.1.1. Spreadsheets

8.1.2. Other GUI-based statistical applications

8.1.3. Data science for the masses

8.2. Programming

8.2.1. Getting started with programming

8.2.2. Languages

8.3. Choosing statistical software tools

8.3.1. Does the tool have an implementation of the methods?

8.3.2. Flexibility is good

8.3.3. Informative is good

8.3.4. Common is good

8.3.5. Well-documented is good

8.3.6. Purpose-built is good

8.3.7. Interoperability is good

8.3.8. Permissive licenses are good

8.3.9. Knowledge and familiarity are good

8.4. Translating statistics into software

8.4.1. Using built-in methods

8.4.2. Writing your own methods

8.5. Exercises

8.6. Summary

9. Supplementary software: bigger, faster, more efficient

9.1. Databases

9.1.1. Types of databases

9.1.2. Benefits of databases

9.1.3. How to use databases

9.1.4. When to use databases

9.2. High-performance computing

9.2.1. Types of HPC

9.2.2. Benefits of HPC

9.2.3. How to use HPC

9.2.4. When to use HPC

9.3. Cloud services

9.3.1. Types of cloud services

9.3.2. Benefits of cloud services

9.3.3. How to use cloud services

9.3.4. When to use cloud services

9.4. Big data technologies

9.4.1. Types of big data technologies

9.4.2. Benefits of big data technologies

9.4.3. How to use big data technologies

9.4.4. When to use big data technologies

9.5. Anything as a Service (AaaS)

9.6. Exercises

9.7. Summary

10. Plan execution: putting it all together

10.1. Tips for executing the plan

10.1.1. If you’re a statistician

10.1.2. If you’re a software engineer

10.1.3. If you’re a beginner

10.1.4. If you’re a member of a team

10.1.5. If you’re leading a team

10.2. Modifying the plan in progress

10.2.1. Sometimes the goals change

10.2.2. Something might be more difficult than you thought

10.2.3. Sometimes you realize you made a bad choice

10.3. Results: knowing when they’re good enough

10.3.1. Statistical significance

10.3.2. Practical usefulness

10.3.3. Re-evaluating your original accuracy and significance goals

10.4. Case study: protocols for measurement of gene activity

10.4.1. The project

10.4.2. What I knew

10.4.3. What I needed to learn

10.4.4. The resources

10.4.5. The statistical model

10.4.6. The software

10.4.7. The plan

10.4.8. The results

10.4.9. Submitting for publication and feedback

10.4.10. How it ended

10.5. Exercises

10.6. Summary

Part 3: Finishing off the product and wrapping up

11. Delivering a product

11.1. Understanding your customer

11.1.1. Who is the entire audience for the results?

11.1.2. What will be done with the results?

11.2. Delivery media

11.2.1. Report or white paper

11.2.2. Analytical tool

11.2.3. Interactive graphical application

11.2.4. Instructions for how to re-do the analysis

11.2.5. Other types of products

11.3. Content

11.3.1. Make important, conclusive results prominent

11.3.2. Don’t include results that are virtually non-conclusive

11.3.3. Include obvious disclaimers for less significant results

11.3.4. User experience

11.4. Example: analyzing video game play

11.5. Exercises

11.6. Summary

12. After product delivery: problems and revisions

12.1. Problems with the product and its use

12.1.1. Customers not using the product correctly

12.1.2. UX problems

12.1.3. Software bugs

12.1.4. The product doesn’t solve real problems

12.2. Feedback

12.2.1. Feedback means someone is using your product

12.2.2. Feedback is not disapproval

12.2.3. Read between the lines

12.2.4. Ask for feedback if you must

12.3. Product revisions

12.3.1. Uncertainty can make revisions necessary

12.3.2. Designing revisions

12.3.3. Engineering revisions

12.3.4. Deciding which revisions to make

12.4. Exercises

12.5. Summary

13. Wrapping up: putting the project away

13.1. Putting the project away neatly

13.1.1. Documentation

13.1.2. Storage

13.1.3. Thinking ahead to future scenarios

13.1.4. Best practices

13.2. Learning from the project

13.2.1. Project post mortem

13.3. Looking towards the future

13.4. Exercises

13.5. Summary

About the Technology

Data science is more than just a set of tools and techniques for extracting knowledge from data sets and data streams. Data science is also a process of getting from goals and questions to real, valuable outcomes by exploring, observing, and manipulating a world of data. Traversing this world can be difficult and confusing. Software developers and non-technical folks may struggle with the uncertainty and fuzzy answers that data invariably provide, and statisticians may have trouble working with any of the multitude of relevant software tools that lie outside of their expertise. Others may not even know where to begin.

What's inside

  • Following the data science process data science, step by step
  • Discovering the world of data as a wilderness to be explored, wrangled, and studied
  • Learning to foresee problems at each stage of a data science project
  • Dealing with the uncertainty inherent in working with data
  • Understanding concepts of data, software, and statistics in ways accessible even for beginners
  • Engaging in some of the most relevant best practices in software, statistics, and scientific thinking

About the reader

Readers should have some familiarity with a programming language and basic statistics, but need not be experts. Important foundational concepts will be reviewed briefly in the book.

About the author

Brian Godsey, Ph.D., is a mathematician, entrepreneur, investor, and data scientist. He has worked in the analytic software industry, in academia, in finance, at the U.S. Department of Defense, and most recently as co-founder or early employee of data-centric start-ups, including Unoceros, Panopticon Labs, and RedOwl.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
Buy
  • MEAP combo $44.99 pBook + eBook
  • MEAP eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks