{"id":298737,"date":"2019-11-12T05:00:59","date_gmt":"2019-11-12T12:00:59","guid":{"rendered":"https:\/\/css-tricks.com\/?p=298737"},"modified":"2019-11-12T15:16:50","modified_gmt":"2019-11-12T22:16:50","slug":"scrapestack-an-api-for-scraping-sites","status":"publish","type":"post","link":"https:\/\/css-tricks.com\/scrapestack-an-api-for-scraping-sites\/","title":{"rendered":"scrapestack: An API for Scraping Sites"},"content":{"rendered":"

Not every site has an API to access data from it. Most don’t, in fact. If you need to pull that data, one approach is to “scrape” it. That is, load the page in web browser (that you automate), find what you are looking for in the DOM, and take it.<\/p>\n

You can do this yourself if you want to deal with the cost, maintenance, and technical debt. For example, this is one of the big use-cases for “headless” browsers, like how Puppeteer<\/a> can spin up and control headless Chrome. <\/p>\n

Or, you can use a tool like scrapestack<\/a> that is a ready-to-use API that not only does the scraping for you, but likely does it better, faster, and with more options than trying to do it yourself.<\/p>\n

<\/p>\n

Say my goal is to pull the latest completed meetup from a Meetup.com page. Meetup.com has an API, but it’s pricy and requires OAuth and stuff. All we need is the name and link of a past meetup here, so let’s just yank it off the page.<\/p>\n

We can see what we need in the DOM:<\/p>\n

\"\"<\/figure>\n

To have a play, let’s scrape it with the scrapestack API client-side with jQuery:<\/p>\n

$.get('https:\/\/api.scrapestack.com\/scrape',\r\n  {\r\n    access_key: 'MY_API_KEY',\r\n    url: 'https:\/\/www.meetup.com\/BendJS\/'\r\n  },\r\n  function(websiteContent) {\r\n     \/\/ we have the entire sites HTML here! \r\n  }\r\n);<\/code><\/pre>\n

Within that callback, I can now also use jQuery to traverse the DOM, snagging the pieces I want, and constructing what I need on our site:<\/p>\n

\/\/ Get what we want\r\nlet event = $(websiteContent)\r\n  .find(\".groupHome-eventsList-pastEvents .eventCard\")\r\n  .first();\r\nlet eventTitle = event\r\n  .find(\".eventCard--link\")[0].innerText;\r\nlet eventLink = \r\n  `https:\/\/www.meetup.com\/` + \r\n  event.find(\".eventCard--link\").attr(\"href\");\r\n\r\n\/\/ Use it on page\r\n$(\"#event\").append(`\r\n  ${eventTitle}<\/a>\r\n`);<\/code><\/pre>\n

In real usage, if we were doing it client-side like this, we’d make use of some rudimentary storage so we wouldn’t have to hit the API on every page load, like sticking the result in localStorage<\/code> and invalidating after a few days or something.<\/p>\n

It works!<\/p>\n

\"\"<\/figure>\n

It’s actually much more likely that we do our scraping server-side. For one thing, that’s the way to protect your API keys, which is your responsibility, and not really possible on a public-facing site if you’re using the API directly client-side.<\/p>\n

Myself, I’d probably make a cloud function<\/a> to do it, so I can stay in JavaScript (Node.js), and have the opportunity to tuck the data in storage<\/a> somewhere. <\/p>\n

I’d say go check out the documentation<\/a> and see if this isn’t the right answer next time you need to do some scraping. You get 10,000 requests on the free plan to try it out anyway, and it jumps up a ton on any of the paid plans with more features.<\/p>\n","protected":false},"excerpt":{"rendered":"

Not every site has an API to access data from it. Most don’t, in fact. If you need to pull that data, one approach is to “scrape” it. That is, load the page in web browser (that you automate), find what you are looking for in the DOM, and take it. You can do this […]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"sig_custom_text":"","sig_image_type":"featured-image","sig_custom_image":0,"sig_is_disabled":false,"inline_featured_image":false,"c2c_always_allow_admin_comments":false,"footnotes":"","jetpack_publicize_message":"scrapestack: An API for Scraping Sites","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[4,17,508],"tags":[],"jetpack_publicize_connections":[],"acf":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":238898,"url":"https:\/\/css-tricks.com\/what-is-a-headless-cms\/","url_meta":{"origin":298737,"position":0},"title":"What is a Headless CMS?","date":"March 11, 2016","format":false,"excerpt":"Have you heard this term going around? It's quite in vogue. It's very related to The Big Conversation\u2122 on the web the last many years. How are we going to handle bringing Our Stuff\u2122 all these different devices\/screens\/inputs. Responsive design says \"let's let our design and media accommodate as much\u2026","rel":"","context":"In "Article"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":299541,"url":"https:\/\/css-tricks.com\/headless-mode\/","url_meta":{"origin":298737,"position":1},"title":"“Headless Mode”","date":"December 2, 2019","format":false,"excerpt":"A couple of months ago, we invited Marc Anton Dahmen to show off his database-less content management system (CMS) Automad. His post is an interesting inside look at templating engines, including how they work, how CMSs use them, and how they impact the way we write things, such as loops.\u2026","rel":"","context":"In "Article"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/css-tricks.com\/wp-content\/uploads\/2019\/11\/cubes-pattern.png?fit=1200%2C600&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":299410,"url":"https:\/\/css-tricks.com\/web-scraping-made-simple-with-zenscrape\/","url_meta":{"origin":298737,"position":2},"title":"Web Scraping Made Simple With Zenscrape","date":"November 28, 2019","format":false,"excerpt":"Web scraping has always been taken care of by actual developers, since a lot of coding, proxy management and CAPTCHA-solving is involved. However, the scraped data is very often needed by people that are non-coders: Marketers, Analysts, Business Developers etc. Zenscrape is an easy-to-use web scraping tool that allows people\u2026","rel":"","context":"In "Sponsored"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/css-tricks.com\/wp-content\/uploads\/2019\/11\/web-scraper-visual.png?fit=1200%2C600&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":272028,"url":"https:\/\/css-tricks.com\/headless-cms-fresh-air-for-developers\/","url_meta":{"origin":298737,"position":3},"title":"Headless CMS: The Developers\u2019 Best Friend","date":"June 7, 2018","format":false,"excerpt":"Your current CMS sucks! You know that for some time already but have not decided yet what your next solution should be. You've noticed all the buzz around headless CMS but you're still not sure what is in it for you and how it can solve all your woes. What\u2026","rel":"","context":"In "Link"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/css-tricks.com\/wp-content\/uploads\/2018\/06\/display-content-1200x600.ai_.png?fit=1200%2C600&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":342248,"url":"https:\/\/css-tricks.com\/just-how-niche-is-headless-wordpress\/","url_meta":{"origin":298737,"position":4},"title":"Just How Niche is Headless WordPress?","date":"June 15, 2021","format":false,"excerpt":"I wonder where headless WordPress will land. And by \"headless\" I mean only using the WordPress admin and building out the user-facing site through the WordPress REST API rather than the traditional WordPress theme structure. Is it... big? The future of WordPress? Or relatively niche? Where's the demand? Certainly, there\u2026","rel":"","context":"In "Article"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/css-tricks.com\/wp-content\/uploads\/2021\/06\/wordpress-logo-sketch.png?fit=1200%2C600&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":243580,"url":"https:\/\/css-tricks.com\/learning-cope-microservices\/","url_meta":{"origin":298737,"position":5},"title":"Learning to COPE with Microservices","date":"July 22, 2016","format":false,"excerpt":"I vividly remember my first encounter with a content management system: It was 2002 with a platform called PHP-Nuke. It offered a control panel where site administrators could publish new content that would be immediately available to readers, without the need to create\/edit HTML files and upload them via FTP\u2026","rel":"","context":"In "Article"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"featured_media_src_url":null,"_links":{"self":[{"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/posts\/298737"}],"collection":[{"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/comments?post=298737"}],"version-history":[{"count":5,"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/posts\/298737\/revisions"}],"predecessor-version":[{"id":298850,"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/posts\/298737\/revisions\/298850"}],"wp:attachment":[{"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/media?parent=298737"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/categories?post=298737"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/css-tricks.com\/wp-json\/wp\/v2\/tags?post=298737"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}