Getting Structured Data from the Internet

Running Web Crawlers/Scrapers on a Big Data Production Scale



Bookstore > Books > Getting Structured Data from the Internet

Price$44.99
Rating
AuthorJay M. Patel
PublisherApress
Published2020
Pages397
LanguageEnglish
FormatPaper book / ebook (PDF)
ISBN-101484265750
ISBN-139781484265758
EBook Hardcover Paperback

Utilize web scraping at scale to quickly get unlimited amounts of free data available on the web into a structured format. This book teaches you to use Python scripts to crawl through websites at scale and scrape data from HTML and JavaScript-enabled pages and convert it into structured data formats such as CSV, Excel, JSON, or load it into a SQL database of your choice.

This book goes beyond the basics of web scraping and covers advanced topics such as natural language processing (NLP) and text analytics to extract names of people, places, email addresses, contact details, etc., from a page at production scale using distributed big data techniques on an Amazon Web Services (AWS)-based cloud infrastructure. It book covers developing a robust data processing and ingestion pipeline on the Common Crawl corpus, containing petabytes of data publicly available and a web crawl data set available on AWS's registry of open data.

Getting Structured Data from the Internet also includes a step-by-step tutorial on deploying your own crawlers using a production web scraping framework (such as Scrapy) and dealing with real-world issues (such as breaking Captcha, proxy IP rotation, and more). Code used in the book is provided to help you understand the concepts in practice and write your own web crawler to power your business ideas.




Similar Books


Web Scraping with Python, 2nd Edition

Web Scraping with Python, 2nd Edition

by Ryan Mitchell

If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also serves as a comprehensive guide to scraping alm...

Price:  $35.87  |  Publisher:  O'Reilly Media  |  Release:  2018

Program the Internet of Things with Swift for iOS

Program the Internet of Things with Swift for iOS

by Ahmed Bakir, Manny de la Torriente, Gheorghe Chesler

Program the Internet of Things with Swift and iOS is a detailed tutorial that will teach you how to build apps using Apple's native APIs for the Internet of Things, including the Apple Watch, HomeKit, and Apple Pay. This is the second book by Ahmed Bakir (author of Beginning iOS Media App Development) and his team at devAtelier LLC, ...

Price:  $45.10  |  Publisher:  Apress  |  Release:  2015

Program the Internet of Things with Swift for iOS, 2nd Edition

Program the Internet of Things with Swift for iOS, 2nd Edition

by Ahmed Bakir

Learn how to build apps using Apple's native APIs for the Internet of Things, including the Apple Watch, HomeKit, and Apple Pay. You'll also see how to interface with popular third-party hardware such as the Raspberry Pi, Arduino, and the FitBit family of devices.Program the Internet of Things with Swift and iOS is an update to ...

Price:  $28.47  |  Publisher:  Apress  |  Release:  2018

Modeling the Internet and the Web

Modeling the Internet and the Web

by Pierre Baldi, Paolo Frasconi, Padhraic Smyth

Modeling the Internet and the Web covers the most important aspects of modeling the Web using a modern mathematical and probabilistic treatment. It focuses on the information and application layers, as well as some of the emerging properties of the Internet.Interdisciplinary in nature, Modeling the Internet and the Web will be of interest...

Price:  $6.98  |  Publisher:  Wiley  |  Release:  2003

MySQL for the Internet of Things

MySQL for the Internet of Things

by Charles Bell

This book introduces the problems facing Internet of Things developers and explores current technologies and techniques to help you manage, mine, and make sense of the data being collected through the use of the world's most popular database on the Internet - MySQL.The IoT is poised to change how we interact with and perceive the wor...

Price:  $25.44  |  Publisher:  Apress  |  Release:  2016

Web Scraping with Python

Web Scraping with Python

by Ryan Mitchell

Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you'll learn how to use Python scripts and web APIs to gather and process data from thousands - or even millions - of web pages at once.Ideal for programmers, security professionals, and web administrators...

Price:  $14.00  |  Publisher:  O'Reilly Media  |  Release:  2015

Linked Data

Linked Data

by David Wood, Marsha Zaidman, Luke Ruth, Michael Hausenblas

The current Web is mostly a collection of linked documents useful for human consumption. The evolving Web includes data collections that may be identified and linked so that they can be consumed by automated processes. The W3C approach to this is Linked Data and it is already used by Google, Facebook, IBM, Oracle, and government agencies ...

Price:  $15.45  |  Publisher:  Manning  |  Release:  2014

Building a Virtual Assistant for Raspberry Pi

Building a Virtual Assistant for Raspberry Pi

by Tanay Pant

Build a voice-controlled virtual assistant using speech-to-text engines, text-to-speech engines, and conversation modules. This book shows you how to program the virtual assistant to gather data from the internet (weather data, data from Wikipedia, data mining); play music; and take notes. Each chapter covers building a mini project/modul...

Price:  $29.68  |  Publisher:  Apress  |  Release:  2016