What will be scraped
Prerequisites
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.
Separate virtual environment
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
📌Note: this is not a strict requirement for this blog post.
Install libraries:
pip install playwright parsel
You also need to install chromium for playwright
to work and operate the browser:
playwright install chromium
After that, if you're on Linux, you might need to install additional things (playwright
will prompt you in the terminal in case something is missing):
sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libatspi2.0-0 libwayland-client0
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites and some of them will be covered in this blog post.
Full Code
import time, json, re
from parsel import Selector
from playwright.sync_api import sync_playwright
def run(playwright):
page = playwright.chromium.launch(headless=True).new_page()
page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")
user_comments = []
# if "See all reviews" button present
if page.query_selector('.Jwxk6d .u4ICaf button'):
print("the button is present.")
print("clicking on the button.")
page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)
print("waiting a few sec to load comments.")
time.sleep(4)
last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop') # 2200
while True:
print("scrolling..")
page.keyboard.press("End")
time.sleep(3)
new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')
if new_height == last_height:
break
else:
last_height = new_height
selector = Selector(text=page.content())
page.close()
print("done scrolling. Exctracting comments...")
for index, comment in enumerate(selector.css(".RHo1pe"), start=1):
comment_likes = comment.css(".AJTPZc::text").get()
user_comments.append({
"position": index,
"user_name": comment.css(".X5PpBb::text").get(),
"user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
"user_comment": comment.css(".h3YV2d::text").get(),
"comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
"app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
"comment_date": comment.css(".bp9Aid::text").get(),
"developer_comment": {
"dev_title": comment.css(".I6j64d::text").get(),
"dev_comment": comment.css(".ras4vb div::text").get(),
"dev_comment_date": comment.css(".I9Jtec::text").get()
}
})
print(json.dumps(user_comments, indent=2, ensure_ascii=False))
with sync_playwright() as playwright:
run(playwright)
Code Explanation
Import libraries:
import time, json
from playwright.sync_api import sync_playwright
time
to set asleep()
intervals between each scroll.json
just for pretty printing.sync_playwright
for synchronous API.playwright
have asynchronous API as well usingasyncio
module.
Declare a function:
def run(playwright):
# further code..
Initialize playwright
, connect to chromium
, launch()
a browser new_page()
and goto()
a given URL:
page = playwright.chromium.launch(headless=False).new_page()
page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")
user_comments = [] # temporary list for all extracted data
playwright.chromium
is a connection to the Chromium browser instance.launch()
will launch the browser, andheadless
argument will run it in headless mode. Default isTrue
.new_page()
creates a new page in a new browser context.page.goto("URL")
will make a request to provided website.
Next, we need to check if the button responsible for showing all reviews is present and click on it if present:
if page.query_selector('.Jwxk6d .u4ICaf button'):
print("the button is present.")
print("clicking on the button.")
page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)
print("waiting a few sec to load comments.")
time.sleep(4)
query_selector
is function that accepts CSS selectors to be searched.click
is to click on the button andforce=True
will bypass any auto-waits and click immidiately.
Scroll to the bottom of the comments window:
last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop') # 2200
while True:
print("scrolling..")
page.keyboard.press("End")
time.sleep(3)
new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')
if new_height == last_height:
break
else:
last_height = new_height
page.evaluate()
will run a JavaScript code in the browser context that will measurement of the height of the.fysCi
selector.scrollTop
gets the number of pixels scrolled from a given element, in this case CSS selector.time.sleep(3)
will stop code execution for 3 seconds to load more comments.- Then it will measure a
new_height
after the scroll running the same measurement JavaScript code. - Finally, it will check
if new_height == last_height
, and if so, exit thewhile
loop by usingbreak
. else
set thelast_height
tonew_height
and run the iteration (scroll) again.
After that, pass scrolled HTML content to parsel
, close
the browser:
selector = Selector(text=page.content())
page.close()
Iterate over all results after the while
loop is done:
for index, comment in enumerate(selector.css(".RHo1pe"), start=1):
comment_likes = comment.css(".AJTPZc::text").get()
user_comments.append({
"position": index,
"user_name": comment.css(".X5PpBb::text").get(),
"user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
"user_comment": comment.css(".h3YV2d::text").get(),
"comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
"app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
"comment_date": comment.css(".bp9Aid::text").get(),
"developer_comment": {
"dev_title": comment.css(".I6j64d::text").get(),
"dev_comment": comment.css(".ras4vb div::text").get(),
"dev_comment_date": comment.css(".I9Jtec::text").get()
}
})
Print the data:
print(json.dumps(user_comments, indent=2, ensure_ascii=False))
Run your code using context manager:
with sync_playwright() as playwright:
run(playwright)
Output
[
{
"position": 1,
"user_name": "Selby Warren",
"user_avatar": "https://play-lh.googleusercontent.com/a-/ACNPEu9_6h31fmuFO-BQOYPjA2oVz9sJXxaI6sL3ZuPdrw=s64-rw",
"user_comment": "Tried logging in on multiple different devices, reset the password, uninstalled then reinstalled, all to no avail. The old app was fine, just update that one instead of creating a new one full of errors. @BN the issue has NOT been resolved. The issue is with the app, not the account, so there is nothing customer service can do.",
"comment_likes": "9",
"app_rating": "1",
"comment_date": "2 September 2022",
"developer_comment": {
"dev_title": "Barnes & Noble",
"dev_comment": "Sorry for the difficulties you had signing in. This issue has been addressed, Please try it again now. If the issue persists, contact us at service@bn.com with the account details.",
"dev_comment_date": "2 September 2022"
}
}, ... other results
{
"position": 875,
"user_name": "Originalbigguy",
"user_avatar": "https://play-lh.googleusercontent.com/a/ALm5wu3dYTOHvlG8SUqgyTbRnjv9I49JtxgySY-RwTJU=s64-rw-mo",
"user_comment": "Not free",
"comment_likes": null,
"app_rating": "1",
"comment_date": "9 April 2021",
"developer_comment": {
"dev_title": "Collectorz.com",
"dev_comment": "The app is never advertised as free anywhere. The app information clearly states this is a paid subscription app.\n",
"dev_comment_date": "10 April 2021"
}
}
]
Using Google Play Product Reviews API
As we support extracting reviews data from Google Play App, this section is to show the comparison between the DIY solution and our solution.
The biggest difference is that you don't need to use browser automation to scrape results, create the parser from scratch and maintain it.
Keep in mind that there's also a chance that the request might be blocked at some point from Google (or CAPTCHA), we handle it on our backend.
Installing google-search-results
from PyPi:
pip install google-search-results
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
params = {
"api_key": "...", # your serpapi api key
"engine": "google_play_product", # serpapi parsing engine
"store": "apps", # app results
"gl": "us", # country of the search
"hl": "en", # language of the search
"product_id": "com.collectorz.javamobile.android.books" # app id
}
search = GoogleSearch(params) # where data extraction happens on the backend
reviews = []
while True:
results = search.get_dict() # JSON -> Python dict
for review in results["reviews"]:
reviews.append({
"title": review.get("title"),
"avatar": review.get("avatar"),
"rating": review.get("rating"),
"likes": review.get("likes"),
"date": review.get("date"),
"snippet": review.get("snippet"),
"response": review.get("response")
})
# pagination
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
else:
break
print(json.dumps(reviews, indent=2, ensure_ascii=False))
Output:
[
{
"title": "JazzTripp",
"avatar": "https://play-lh.googleusercontent.com/a-/ACNPEu8THUUDL3yzcd0bHSDRR4OegOWLmfbFi70On0HbRg",
"rating": 5.0,
"likes": 20,
"date": "May 06, 2022",
"snippet": "This app takes a bit if getting used to at first, but the catalogue is extensive, and most bar codes and isbn numbers can be used to autofill a good chuck of a collection. I personally use this app for manga, and while its only correct about 70% of the time, its still easy to update and change as you see fit. The 'add to core' option makes me feel like im actually helping out the app, so i add data whenever i can. Keep up the good work guys!",
"response": null
}, ... other reviews
{
"title": "Originalbigguy",
"avatar": "https://play-lh.googleusercontent.com/a/ALm5wu3dYTOHvlG8SUqgyTbRnjv9I49JtxgySY-RwTJU=mo",
"rating": 1.0,
"likes": 0,
"date": "April 09, 2021",
"snippet": "Not free",
"response": {
"title": "Collectorz.com",
"snippet": "The app is never advertised as free anywhere. The app information clearly states this is a paid subscription app.",
"date": "April 10, 2021"
}
}
]