applied_text_analysis_with_python.pdf

(14326 KB) Pobierz
Applied Text Analysis with Python
Enabling Language-Aware Data Products with
Machine Learning
Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda
Beijing
Boston Farnham Sebastopol
Tokyo
Applied Text Analysis with Python
by Benjamin Bengfort, Rebecca Bilbro, and Tony Ojeda
Copyright © 2018 Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editor:
Nicole Tache
Production Editor:
Nicholas Adams
Copyeditor:
Jasmine Kwityn
Proofreader:
Christina Edwards
June 2018:
First Edition
Indexer:
WordCo Indexing Services, Inc.
Interior Designer:
David Futato
Cover Designer:
Karen Montgomery
Illustrator:
Rebecca Demarest
Revision History for the First Edition
2018-06-08:
First Release
See
http://oreilly.com/catalog/errata.csp?isbn=9781491963043
for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Applied Text Analysis with Python,
the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-96304-3
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1.
Language and Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Data Science Paradigm
Language-Aware Data Products
The Data Product Pipeline
Language as Data
A Computational Model of Language
Language Features
Contextual Features
Structural Features
Conclusion
What Is a Corpus?
Domain-Specific Corpora
The Baleen Ingestion Engine
Corpus Data Management
Corpus Disk Structure
Corpus Readers
Streaming Data Access with NLTK
Reading an HTML Corpus
Reading a Corpus from a Database
Conclusion
2
4
5
8
8
10
13
15
16
19
20
21
22
24
27
28
31
34
36
38
38
iii
2.
Building a Custom Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.
Corpus Preprocessing and Wrangling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Breaking Down Documents
Identifying and Extracting Core Content
Zgłoś jeśli naruszono regulamin