advanced_analytics_with_spark_2e.pdf

(5602 KB) Pobierz
2n
d
Ed
iti
on
Spark
Advanced
Analytics with
PATTERNS FOR LEARNING FROM DATA AT SCALE
Sandy Ryza, Uri Laserson,
Sean Owen, & Josh Wills
SECOND EDITION
Advanced Analytics with Spark
Patterns for Learning from Data at Scale
Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Beijing
Boston Farnham Sebastopol
Tokyo
Advanced Analytics with Spark
by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Copyright © 2017 Sanford Ryza, Uri Laserson, Sean Owen, Joshua Wills. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor:
Marie Beaugureau
Production Editor:
Melanie Yarbrough
Copyeditor:
Gillian McGarvey
Proofreader:
Christina Edwards
June 2017:
Second Edition
Indexer:
WordCo Indexing Services
Interior Designer:
David Futato
Cover Designer:
Karen Montgomery
Illustrator:
Rebecca Demarest
Revision History for the Second Edition
2017-06-09:
First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Advanced Analytics with Spark,
the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-97295-3
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1.
Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Challenges of Data Science
Introducing Apache Spark
About This Book
The Second Edition
3
4
6
7
2.
Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Scala for Data Scientists
The Spark Programming Model
Record Linkage
Getting Started: The Spark Shell and SparkContext
Bringing Data from the Cluster to the Client
Shipping Code from the Client to the Cluster
From RDDs to Data Frames
Analyzing Data with the DataFrame API
Fast Summary Statistics for DataFrames
Pivoting and Reshaping DataFrames
Joining DataFrames and Selecting Features
Preparing Models for Production Environments
Model Evaluation
Where to Go from Here
Data Set
10
11
12
13
19
22
23
26
32
33
37
38
40
41
44
iii
3.
Recommending Music and the Audioscrobbler Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . 43
Zgłoś jeśli naruszono regulamin