Pagination Techniques to Scrape Data from any Website in Python
Table of contents
Intro
In this blog post will go over most frequent pagination techniques that could be applied to perform dynamic pagination on any website. This blog post is ongoing and will be updated if new techniques will be discovered.
Dynamic vs Hardcoded Pagination
What the heck is dynamic pagination?
Well, it's simply a way to paginate through all available pages without you knowing how many there are, it will just go through them all:
while True:
requests.get('<website_url>')
# data extraction code
# condition to paginate to the next page or to exit pagination
Hardcoded approach differs by explicitly writing the N number of pages we want to paginate over:
# hardcoded way to paginate from 1 to 25th page
for page_num in range(1, 26):
requests.get('<website_url>')
# data extraction code
It's an easy approach if we need to extract data from N number of pages. But what if we need to extract all pages from several, say categories on the same website, and if each category contains different number of pages?
The thing is, when using hardcoded approach, we'll come to the point where we need to update page numbers to meet requirements for every page which is not particularly satisfying :-)
Dynamic pagination exit condition
Let's also stop for a second and see what another difference is dynamic while True
pagination and for page_num in range(...)
approach.
Have you noticed comment in the dynamic pagination: "condition to paginate to the next page or to exit pagination"?
This means that whenever we use a dynamic pagination we always need some condition to exit the infinite loop. It could be: element disappeared, previous page number is different than current, height of the elements are the same, etc.
Prerequisites
If you want to try along, let's create a separate environment first.
If you're on Linux:
python -m venv env && source env/bin/activate
If you're on Windows and using Git Bash:
python -m venv env && source env/Scripts/activate
Next, install needed libraries if you want to try yourself:
$ pip install requests bs4 parsel playwright
requests
: make a request to a website.bs4
: HTML parser.parsel
: Another HTML parser, faster thanbs4
, used in a few examples.playwright
: modern browser automation.
For playwright
, if you're on Linux, you also need to install additional dependencies:
$ sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libatspi2.0-0 libwayland-client0
After that we need to install chromium (or other browsers):
$ playwright install chromium
Types of pagination
There're four most frequent types of pagination:
token pagination using unique token.
- For example:
SAOxcijdaf#Ad21
- For example:
non-token pagination using digits.
- For example: 1,2,3,4,5.
click pagination.
- For example: clicking on the next page button until button disappears.
scroll or JavaScript evaluation pagination.
- For example: scrolling page until no more reviews left. Same could be done (evaluate) with JS code.
📌Those types of pagination could be combined with one or another.
For example, non-token and token pagination values need to be updated at the same time to paginate to the next page. This is how Google Scholar Profile pagination works without using browser automation.
Another example of combined pagination is combining scrolls with clicks as they become needed (when a certain button appears).
Token Pagination
Token pagination is when website generates a token responsible for retrieving next page data. It could look something like this: 27wzAGn-__8J
.
This token most likely will be passed as URL parameter, for example on Google Scholar Profiles page it looks like this:
# ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
https://scholar.google.com/citations?mauthors=biology&after_author=27wzAGn-__8J
In some cases this token need to be combined with other parameters. For example, Google Scholar Profile page has:
astart
parameter which responsible for page number.after_author
parameter that holds next page token responsible for next page.
# ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
https://scholar.google.com/citations?mauthors=biology&after_author=27wzAGn-__8J&astart=10
Dynamic Pagination with Token based Websites
Dynamic pagination on token based websites happens through parsing next page token which can be located in the:
Page source (
CTRL
+U
), somewhere in the inline JSON.
from bs4 import BeautifulSoup
import requests, lxml, re
params = {
"view_op": "search_authors", # profiles tab
"mauthors": "blizzard", # search query
"hl": "en", # language of the search
"astart": 0 # page number
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
authors_is_present = True
while authors_is_present:
html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for author in soup.select(".gs_ai_chpr"):
name = author.select_one(".gs_ai_name a").text
link = f'https://scholar.google.com{author.select_one(".gs_ai_name a")["href"]}'
affiliations = author.select_one(".gs_ai_aff").text
email = author.select_one(".gs_ai_eml").text
try:
cited_by = re.search(r"\d+", author.select_one(".gs_ai_cby").text).group() # Cited by 17143 -> 17143
except: cited_by = None
print(f"extracting authors at page #{params['astart']}.",
name,
link,
affiliations,
email,
cited_by, sep="\n")
# if next page token exists, we extract next page token form HTML node attribute
# and increment `astart` parameter +10
if soup.select_one("button.gs_btnPR")["onclick"]:
params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1) # -> XB0HAMS9__8J
params["astart"] += 10
else:
authors_is_present = False
Non Token Pagination
Non-token pagination is simply when you increment page number by an N number. It could be incremented by 1, 10 (Google Search), 11 (Bing Search), 100 (Google Scholar Author Articles), or other number depending on how pagination on certain website functionates.
As mentioned above, non-token pagination could be combined with token in order to perform pagination.
Dynamic Pagination with Non Token based Websites
To identify if website uses non-token based pagination is simple. Keep an eye on URL parameters, see if there're any digits associated with URL parameters, and see if they are changing.
For example, Google Search has a start
parameter:
# first page (no start parameter, or could be manually set to 0, first page)
https://www.google.com/search?q=sushi
# second page ▼▼▼▼▼▼▼▼
https://www.google.com/search?q=sushi&start=10
# third page ▼▼▼▼▼▼▼▼
https://www.google.com/search?q=sushi&start=20
A code example of non-token pagination is Google Search results. The following code scrapes all results from all pages.
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "sushi", # search query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_num = 0
while True:
page_num += 1
print(f"{page_num} page:")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = f'Title: {result.select_one("h3").text}'
link = f'Link: {result.select_one("a")["href"]}'
try:
description = f'Description: {result.select_one(".VwiC3b").text}'
except:
description = None
print(title, link, description, sep="\n", end="\n\n")
# if arrow button with attribute 'pnnext' is present -> paginate
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Click Pagination
As you understand, it performs a click and can only be used with browser automation such as playwright
or selenium
because these libraries provide method to click on a given element.
Dynamic Pagination with Clicks
All we need to do is to find the button or whatever element responsible for next page button via CSS selector or XPath.
After that we need to perform a click()
method:
# page = playwright.chromium.launch(headless=True).new_page()
# page.goto('<website_url>')
page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)
Scroll or JavaScript Evaluation
The scroll pagination technique requires scrolls in order to perform pagination. Scrolls could be either top-bottom or side to side depending on how a website works.
Dynamic Pagination with Scrolls
There're three frequent methods that I use to perform scrolls with either playwright
or selenium
:
page.keyboard.press(<key>)
page.evaluate(<JS_code>)
page.mouse.wheel(<scrollX>, <scrollY>)
Pressing a keyboard button to perform a scroll down:
# page = playwright.chromium.launch(headless=True).new_page()
# page.goto('<website_url>')
page.keyboard.press('END') # scrolls to possible end of the page
Evaluating JavaScript code to perform a scroll down:
# page = playwright.chromium.launch(headless=True).new_page()
# page.goto('<website_url>')
page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
scrollingElement.scrollTop = scrollingElement scrollHeight;""")
📌We have to keep in mind that whenever we use a scroll pagination, we always need to perform a condition check that checks height of a certain element before and after scroll.
If height before and after scroll is the same, this will be a signal that there's more space for additional scrolls:
last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop') # 2200
while True:
print("scrolling..")
page.keyboard.press("End")
time.sleep(3)
new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop') # 2800
if new_height == last_height:
break
else:
last_height = new_height
Here's a complete example from one of my blog posts with step-by-step explanation about scraping all Google Play App Reviews in Python that uses click, keyboard press, and evaluate to check for current height:
import time, json, re
from parsel import Selector
from playwright.sync_api import sync_playwright
def run(playwright):
page = playwright.chromium.launch(headless=True).new_page()
page.goto("https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en_GB&gl=US")
user_comments = []
# if "See all reviews" button present
if page.query_selector('.Jwxk6d .u4ICaf button'):
print("the button is present.")
print("clicking on the button.")
page.query_selector('.Jwxk6d .u4ICaf button').click(force=True)
print("waiting a few sec to load comments.")
time.sleep(4)
last_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop') # 2200
while True:
print("scrolling..")
page.keyboard.press("End")
time.sleep(3)
new_height = page.evaluate('() => document.querySelector(".fysCi").scrollTop')
if new_height == last_height:
break
else:
last_height = new_height
selector = Selector(text=page.content())
page.close()
print("done scrolling. Exctracting comments...")
for index, comment in enumerate(selector.css(".RHo1pe"), start=1):
comment_likes = comment.css(".AJTPZc::text").get()
user_comments.append({
"position": index,
"user_name": comment.css(".X5PpBb::text").get(),
"user_avatar": comment.css(".gSGphe img::attr(srcset)").get().replace(" 2x", ""),
"user_comment": comment.css(".h3YV2d::text").get(),
"comment_likes": comment_likes.split("people")[0].strip() if comment_likes else None,
"app_rating": re.search(r"\d+", comment.css(".iXRFPc::attr(aria-label)").get()).group(),
"comment_date": comment.css(".bp9Aid::text").get(),
"developer_comment": {
"dev_title": comment.css(".I6j64d::text").get(),
"dev_comment": comment.css(".ras4vb div::text").get(),
"dev_comment_date": comment.css(".I9Jtec::text").get()
}
})
print(json.dumps(user_comments, indent=2, ensure_ascii=False))
with sync_playwright() as playwright:
run(playwright)
Example from the another blog post of mine that shows how to scrape all Naver Video results:
from playwright.sync_api import sync_playwright
import json
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")
video_results = []
not_reached_end = True
while not_reached_end:
page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
scrollingElement.scrollTop = scrollingElement scrollHeight;""")
if page.locator("#video_max_display").is_visible():
not_reached_end = False
for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):
title = video.query_selector(".text").inner_text()
link = video.query_selector(".info_title").get_attribute("href")
thumbnail = video.query_selector(".thumb_area img").get_attribute("src")
channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
origin = video.query_selector(".origin").inner_text()
video_duration = video.query_selector(".time").inner_text()
views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
video.query_selector(".desc_group .desc:nth-child(2)").inner_text()
video_results.append({
"position": index,
"title": title,
"link": link,
"thumbnail": thumbnail,
"channel": channel,
"origin": origin,
"video_duration": video_duration,
"views": views,
"date_published": date_published
})
print(json.dumps(video_results, indent=2, ensure_ascii=False))
browser.close()
In the Naver pagination example, have a look at the if
a condition that exits an infinite loop:
if page.locator("#video_max_display").is_visible():
not_reached_end = False
Conclusion
Keep an eye on URL parameters. If something is changing when pagination if performed, then it could be a sign that those parameters can be used to perform pagination programmatically.
Try to find the next page tokens in the page source.
If nothing can be found from the points above, use either click or scroll pagination, or both.
Hope you found it useful. Let me know if something is still confusing.