web_scraping_with_python.pdf

(6112 KB) Pobierz

Web Scraping

with Python

COLLECTING MORE DATA FROM THE MODERN WEB

Ryan Mitchell

iti

SECOND EDITION

Collecting More Data from the Modern Web

Web Scraping with Python

Ryan Mitchell

Beijing

Boston Farnham Sebastopol

Tokyo

Web Scraping with Python

by Ryan Mitchell

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are

also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐

tutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor:

Allyson MacDonald

Production Editor:

Justin Billing

Copyeditor:

Sharon Wilkey

Proofreader:

Christina Edwards

April 2018:

Second Edition

Indexer:

Judith McConville

Interior Designer:

David Futato

Cover Designer:

Karen Montgomery

Illustrator:

Rebecca Demarest

Revision History for the Second Edition

2018-03-20:

2018-11-21:

First Release

Second Release

See

http://oreilly.com/catalog/errata.csp?isbn=9781491985571

for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.

Web Scraping with Python,

the cover

image, and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and

instructions contained in this work are accurate, the publisher and the author disclaim all responsibility

for errors or omissions, including without limitation responsibility for damages resulting from the use of

or reliance on this work. Use of the information and instructions contained in this work is at your own

risk. If any code samples or other technology this work contains or describes is subject to open source

licenses or the intellectual property rights of others, it is your responsibility to ensure that your use

thereof complies with such licenses and/or rights.

978-1-491-98557-1

[LSI]

Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Part I.

Building Scrapers

Your First Web Scraper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Connecting

An Introduction to BeautifulSoup

Installing BeautifulSoup

Running BeautifulSoup

Connecting Reliably and Handling Exceptions

You Don’t Always Need a Hammer

Another Serving of BeautifulSoup

find() and find_all() with BeautifulSoup

Other BeautifulSoup Objects

Navigating Trees

Regular Expressions

Regular Expressions and BeautifulSoup

Accessing Attributes

Lambda Expressions

Traversing a Single Domain

Crawling an Entire Site

Collecting Data Across an Entire Site

Crawling Across the Internet

Advanced HTML Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Writing Web Crawlers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Web Crawling Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Planning and Defining Objects

Dealing with Different Website Layouts

iii

web_scraping_with_python.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: