Getting Structured Data from the Internet
Running Web Crawlers/Scrapers on a Big Data Production Scale
Price | $44.99
|
Rating | |
Author | Jay M. Patel |
Publisher | Apress |
Published | 2020 |
Pages | 397 |
Language | English |
Format | Paper book / ebook (PDF) |
ISBN-10 | 1484265750 |
ISBN-13 | 9781484265758 |
Utilize web scraping at scale to quickly get unlimited amounts of free data available on the web into a structured format. This book teaches you to use Python scripts to crawl through websites at scale and scrape data from HTML and JavaScript-enabled pages and convert it into structured data formats such as CSV, Excel, JSON, or load it into a SQL database of your choice.
This book goes beyond the basics of web scraping and covers advanced topics such as natural language processing (NLP) and text analytics to extract names of people, places, email addresses, contact details, etc., from a page at production scale using distributed big data techniques on an Amazon Web Services (AWS)-based cloud infrastructure. It book covers developing a robust data processing and ingestion pipeline on the Common Crawl corpus, containing petabytes of data publicly available and a web crawl data set available on AWS's registry of open data.
Getting Structured Data from the Internet also includes a step-by-step tutorial on deploying your own crawlers using a production web scraping framework (such as Scrapy) and dealing with real-world issues (such as breaking Captcha, proxy IP rotation, and more). Code used in the book is provided to help you understand the concepts in practice and write your own web crawler to power your business ideas.
- Jay M. Patel
Similar Books
Web Scraping with Python, 2nd Edition
by Ryan Mitchell
If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also serves as a comprehensive guide to scraping alm...
Price: $35.87 | Publisher: O'Reilly Media | Release: 2018
Program the Internet of Things with Swift for iOS
by Ahmed Bakir, Manny de la Torriente, Gheorghe Chesler
Program the Internet of Things with Swift and iOS is a detailed tutorial that will teach you how to build apps using Apple's native APIs for the Internet of Things, including the Apple Watch, HomeKit, and Apple Pay. This is the second book by Ahmed Bakir (author of Beginning iOS Media App Development) and his team at devAtelier LLC, ...
Price: $45.10 | Publisher: Apress | Release: 2015
Program the Internet of Things with Swift for iOS, 2nd Edition
by Ahmed Bakir
Learn how to build apps using Apple's native APIs for the Internet of Things, including the Apple Watch, HomeKit, and Apple Pay. You'll also see how to interface with popular third-party hardware such as the Raspberry Pi, Arduino, and the FitBit family of devices.Program the Internet of Things with Swift and iOS is an update to ...
Price: $28.47 | Publisher: Apress | Release: 2018
Modeling the Internet and the Web
by Pierre Baldi, Paolo Frasconi, Padhraic Smyth
Modeling the Internet and the Web covers the most important aspects of modeling the Web using a modern mathematical and probabilistic treatment. It focuses on the information and application layers, as well as some of the emerging properties of the Internet.Interdisciplinary in nature, Modeling the Internet and the Web will be of interest...
Price: $6.98 | Publisher: Wiley | Release: 2003
MySQL for the Internet of Things
by Charles Bell
This book introduces the problems facing Internet of Things developers and explores current technologies and techniques to help you manage, mine, and make sense of the data being collected through the use of the world's most popular database on the Internet - MySQL.The IoT is poised to change how we interact with and perceive the wor...
Price: $25.44 | Publisher: Apress | Release: 2016
by Ryan Mitchell
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you'll learn how to use Python scripts and web APIs to gather and process data from thousands - or even millions - of web pages at once.Ideal for programmers, security professionals, and web administrators...
Price: $14.00 | Publisher: O'Reilly Media | Release: 2015
by David Wood, Marsha Zaidman, Luke Ruth, Michael Hausenblas
The current Web is mostly a collection of linked documents useful for human consumption. The evolving Web includes data collections that may be identified and linked so that they can be consumed by automated processes. The W3C approach to this is Linked Data and it is already used by Google, Facebook, IBM, Oracle, and government agencies ...
Price: $15.45 | Publisher: Manning | Release: 2014
Building a Virtual Assistant for Raspberry Pi
by Tanay Pant
Build a voice-controlled virtual assistant using speech-to-text engines, text-to-speech engines, and conversation modules. This book shows you how to program the virtual assistant to gather data from the internet (weather data, data from Wikipedia, data mining); play music; and take notes. Each chapter covers building a mini project/modul...
Price: $29.68 | Publisher: Apress | Release: 2016