web_scraping_with_python.pdf

(6112 KB) Pobierz
Web Scraping
with Python
COLLECTING MORE DATA FROM THE MODERN WEB
Ryan Mitchell
2n
d
Ed
iti
on
SECOND EDITION
Collecting More Data from the Modern Web
Web Scraping with Python
Ryan Mitchell
Beijing
Boston Farnham Sebastopol
Tokyo
Web Scraping with Python
by Ryan Mitchell
Copyright © 2018 Ryan Mitchell. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor:
Allyson MacDonald
Production Editor:
Justin Billing
Copyeditor:
Sharon Wilkey
Proofreader:
Christina Edwards
April 2018:
Second Edition
Indexer:
Judith McConville
Interior Designer:
David Futato
Cover Designer:
Karen Montgomery
Illustrator:
Rebecca Demarest
Revision History for the Second Edition
2018-03-20:
2018-11-21:
First Release
Second Release
See
http://oreilly.com/catalog/errata.csp?isbn=9781491985571
for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Web Scraping with Python,
the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-98557-1
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Part I.
Building Scrapers
3
6
6
8
10
15
16
18
20
21
25
29
30
31
33
37
40
42
50
53
1.
Your First Web Scraper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Connecting
An Introduction to BeautifulSoup
Installing BeautifulSoup
Running BeautifulSoup
Connecting Reliably and Handling Exceptions
You Don’t Always Need a Hammer
Another Serving of BeautifulSoup
find() and find_all() with BeautifulSoup
Other BeautifulSoup Objects
Navigating Trees
Regular Expressions
Regular Expressions and BeautifulSoup
Accessing Attributes
Lambda Expressions
Traversing a Single Domain
Crawling an Entire Site
Collecting Data Across an Entire Site
Crawling Across the Internet
2.
Advanced HTML Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.
Writing Web Crawlers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.
Web Crawling Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Planning and Defining Objects
Dealing with Different Website Layouts
iii
Zgłoś jeśli naruszono regulamin