Introducing Data Science
Big data, machine learning, and more, using Python tools
Davy Cielen, Arno D. B. Meysman, and Mohamed Ali
  • May 2016
  • ISBN 9781633430037
  • 320 pages
  • printed in black & white

Read this book if you want to get a quick overview of data science, with lots of examples to get you started!

Alvin Raj, Oracle

Introducing Data Science teaches you how to accomplish the fundamental tasks that occupy data scientists. Using the Python language and common Python libraries, you'll experience firsthand the challenges of dealing with data at scale and gain a solid foundation in data science.

Table of Contents detailed table of contents

1. Data science in a Big Data world

1.1. Benefits and uses of data science and Big Data

1.2. Facets of data

1.2.1. Structured data

1.2.2. Unstructured data

1.2.3. Natural language

1.2.4. Machine-generated data

1.2.5. Graph-based or network data

1.2.6. Audio, image, and video

1.2.7. Streaming data

1.3. The data science process

1.3.1. Setting the research goal

1.3.2. Retrieving data

1.3.3. Data cleansing

1.3.4. Data exploration

1.3.5. Data modeling or model building

1.3.6. Presentation and automation

1.4. The Big Data ecosystem and data science

1.4.1. Distributed file systems

1.4.2. Distributed programming framework

1.4.3. Data integration framework

1.4.4. Machine learning frameworks

1.4.5. NoSQL databases

1.4.6. Scheduling tools

1.4.7. Benchmarking tools

1.4.8. System deployment

1.4.9. Service programming

1.4.10. Security

1.5. An introductory working example of Hadoop

1.6. Summary

2. The data science process

2.1. Overview of the data science process

2.1.1. Don't be a slave to the process

2.2. Step 1: defining research goals and creating a project charter

2.2.1. Spend time understanding the goals and context of your research

2.2.2. Create a project charter

2.3. Step 2: retrieving data

2.3.1. Start with data stored within the company

2.3.2. Don't be afraid to shop around

2.3.3. Do data quality checks now to prevent problems later

2.4. Step 3: cleansing, integrating, and transforming data

2.4.1. Cleansing data

2.4.2. Correct errors as early as possible

2.4.3. Combining data from different data sources

2.4.4. Transforming data

2.5. Step 4: exploratory data analysis

2.6. Step 5: Build the models

2.6.1. Model and variable selection

2.6.2. Model execution

2.6.3. Model diagnostics and model comparison

2.7. Step 6: Presenting findings and building applications on top of them

2.8. Summary

3. Machine learning

3.1. What is machine learning and why should you care about it?

3.1.1. Applications for machine learning in data science

3.1.2. Where machine learning is used in the data science process

3.1.3. Python tools used in machine learning

3.2. The modelling process

3.2.1. Engineering features and selecting a model

3.2.2. Training your model

3.2.3. Validating a model

3.2.4. Predicting new observations

3.3. Types of machine learning

3.3.1. Supervised learning

3.3.2. Unsupervised learning

3.4. Semi-supervised learning

3.5. Summary

4. Handling large data on a single computer

4.1. The problems you face when handling large data

4.2. General techniques for handling large volumes of data

4.2.1. Choosing the right algorithm

4.2.2. Choosing the right data structure

4.2.3. Selecting the right tools

4.3. General programming tips for dealing with large datasets

4.3.1. Don't reinvent the wheel

4.3.2. Get the most out of your hardware

4.3.3. Reduce your computing needs

4.4. Case study 1: predicting malicious URLs

4.4.1. Step 1: defining the research goal

4.4.2. Step 2: acquiring the URL data

4.4.3. Step 4 of the data science process: data exploration

4.4.4. Step 5 of data science process: model building

4.5. Case study 2: building a recommender system inside a database

4.5.1. Tools and techniques needed

4.5.2. Step 1 of the data science process: research question

4.5.3. Step 3 of the data science process: data preparation

4.5.4. Step 5 of the data science process: model building

4.5.5. Step 6 of data science process: presentation and automation

4.6. Summary

5. First steps in Big Data

5.1. Distributing data storage and processing with frameworks

5.1.1. Hadoop: a framework for storing and processing large datasets

5.1.2. Now, keeping the workings of Hadoop in mind, let’s look at Spark: replacing MapReduce for better performance

5.2. Case study: assessing risk when loaning money

5.2.1. Part 1 of data science process: the research goal

5.2.2. Part 2 of data science process: data retrieval

5.2.3. Part 3 of data science process: data preparation

5.2.4. Step 4: data exploration & step 6: report building

5.3. Summary

6. Join the NoSQL movement

6.1. Introduction to NoSQL

6.1.1. ACID: the core principle of relational databases

6.1.2. CAP Theorem: the problem with DBs on many nodes

6.1.3. The BASE principles of NoSQL databases

6.1.4. NoSQL database types

6.2. Case study: what disease is that?

6.2.1. Step 1: setting the research goal

6.2.2. Steps 2 and 3: data retrieval and preparation

6.2.3. Step 4: data exploration

6.2.4. Step 3 revisited: data preparation for disease profiling

6.2.5. Step 4 revisited: data exploration for disease profiling

6.2.6. Step 6: presentation and automation

6.3. Summary

7. The rise of graph databases

7.1. Introducing connected data and graph databases

7.1.1. Why and when should I use a graph database?

7.2. Introducing Neo4j: a graph database

7.2.1. Cypher: a graph query language

7.3. Connected data example: a recipe recommendation engine

7.3.1. Step 1: setting the research goal

7.3.2. Step 2: data retrieval

7.3.3. Step 3: data preparation

7.3.4. Step 4: data exploration

7.3.5. Step 5: data modeling

7.3.6. Step 6: presentation

7.4. Summary

8. Text mining and text analytics

8.1. Text mining in the real world

8.2. Text mining techniques

8.2.1. Bag of words

8.2.2. Stemming and lemmatization

8.2.3. Decision tree classifier

8.3. Case study: classifying Reddit posts

8.3.1. Meet the Natural Language Toolkit

8.3.2. Data science process overview and step 1: the research goal

8.3.3. Step 2: data retrieval

8.3.4. Step 3: data preparation

8.3.5. Step 4: data exploration

8.3.6. Step 3 revisited: data preparation adapted

8.3.7. Step 5: data analysis

8.3.8. Step 6: presentation and automation

8.4. Summary

9. Data visualization to the end user

9.1. Data visualization options

9.2. Crossfilter, the JavaScript MapReduce library

9.2.1. Setting everything up

9.2.2. Unleashing Crossfilter to filter the medicine dataset

9.3. Create an interactive dashboard with dc.js

9.4. Dashboard development tools

9.5. Summary


Appendix A: Setting up Elasticsearch

A.1. Linux Installation

A.2. Windows Installation

Appendix B: Setting up Neo4j

B.1. Linux Installation

B.2. Windows Installation

Appendix C: Installing MySQL server

C.1. Windows Installation

C.2. Linux Installation

Appendix D: Setting up anaconda with virtual environment

D.1. Linux Installation

D.2. Windows Installation

D.3. Setting up the Environment

About the Technology

Many companies need developers with data science skills to work on projects ranging from social media marketing to machine learning. Discovering what you need to learn to begin a career as a data scientist can seem bewildering. This book is designed to help you get started.

About the book

Introducing Data Science explains vital data science concepts and teaches you how to accomplish the fundamental tasks that occupy data scientists. You’ll explore data visualization, graph databases, the use of NoSQL, and the data science process. You’ll use the Python language and common Python libraries as you experience firsthand the challenges of dealing with data at scale. Discover how Python allows you to gain insights from data sets so big that they need to be stored on multiple machines, or from data moving so quickly that no single machine can handle it. This book gives you hands-on experience with the most popular Python data science libraries, Scikit-learn and StatsModels. After reading this book, you’ll have the solid foundation you need to start a career in data science.

What's inside

  • Handling large data
  • Introduction to machine learning
  • Using Python to work with data
  • Writing data science algorithms

About the reader

This book assumes you're comfortable reading code in Python or a similar language, such as C, Ruby, or JavaScript. No prior experience with data science is required.

About the authors

Davy Cielen, Arno D. B. Meysman, and Mohamed Ali are the founders and managing partners of Optimately and Maiton, where they focus on developing data science projects and solutions in various sectors.

combo $44.99 pBook + eBook
eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

The map that will help you navigate the data science oceans.

Marius Butuc, Shopify

Covers the processes involved in data science from end to end… A complete overview.

Heather Campbell, Kainos

A must-read for anyone who wants to get into the data science world.

Hector Cuesta, Big Data Bootcamp