Amazon Search Result Scraper using Python
Many of you have heard this term Web Scraping and In this post we'll see what is the web Scraping and how to perform web scraping using Python.before we proceed let me tell you that in this post we'll learn how to make a program that will scrape the contents of Amazon.com's Search results also we'll see how to store the results in JSON(Javascript Object Notation) format.
What is Web Scraping?
- Web Scraping is a method or a way using which we can extract the data from the websites.here data can be anything like all the images,videos,useful contents etc of the website.
- If i talk about libraries in python for web scraping then Python has many libraries like (requests,BeautifulSoup,Selenium,RPA) using which we can perform Web Scraping.
Requirements and Installations:-
For this Amazon Search Result Scraper Project we require the following things
- Selenium Library:- Open your terminal or command prompt and enter this command to install the selenium "pip install selenium".
- the second thing that we require is Web Driver of your browser that you are using,I am using a Google Chrome so i'll use a chrome web driver.
- If you are using some other browser then you require the web driver for that browser just make a google search you'll find that.
You have to select the web driver specifically for your Browser version for example.
My Chrome browser version is 84.0.4147 therefore i will select the web driver version 84.0.4147.30
download and rename it to "chromedriver.exe".
Now make one folder of name Amazon Scraper and inside that folder create a Web_Scraper.py file and one folder of name "chromedriver" and Put your chromedriver.exe in the chromedriver folder.
Project Folder Structure |
Coding:-
Now everything is set we can start with the coding,open the Web_Scraper.py file in your favorite code editor I am using the Jupyter Notebook.
Step 1:- import the require libraries and modules
import selenium from selenium import webdriver import time import os from selenium.webdriver.common import keys import json
Step 2:-Initialize the web driver with the site Amazon.com
#initializing the chrome driver chrome_driver = os.path.join(os.getcwd(),"chromedriver/chromedriver.exe") driver = webdriver.Chrome(chrome_driver) driver.get("https://www.Amazon.com")
above code will open the amazon.com website in the chrome driver
Step 3:-whenever we want to search any product then we enter the name of the product in the Search box and press the search button for searching the product and after pressing the search button results are getting displayed in some area of the web page.So we'll grab these elements in the program by using below code.
grabbing the search box
search_box = driver.find_element_by_id("twotabsearchtextbox") #For making the search search_box.send_keys(input("Enter the Product to search")+"\n")
when you run the above code then it will ask you to enter the product to search, let say we want to search the smartphones the moment when you press enter after giving the product name it will show the results for the smartphones
search results |
Step 4:- Now again do an Inspect element and locate the area in which all the search results are displaying.results are displaying in the "s-latency-cf-section" class therefore we'll grab this element.
results = driver.find_elements_by_class_name("s-latency-cf-section")
Step 5:- First we grab the links of all the images present in the results use the below code to achieve that.
#finding the images of the products images = driver.find_elements_by_tag_name('img') image_list = [] for each_image in images: if each_image.get_attribute('alt').strip(): image_list.append(each_image.get_attribute('src'))
Products in the results will be of 2 type either will be a normal product or its a Sponsored Product
Attribute sequence for Sponsored product:-
- Sponsored
- name
- ratings
- price
Attribute sequence for Normal Product:-
- name
- ratings
- price
therefore we'll use the below code to store the information of each product in the Dictionary.
#result will be stored n the data data={'products':[]} each_detail = {} i=0 for each_result in results[1:]: each_detail = {} try: each_result = each_result.text.split('\n') if each_result !=['']: if each_result[0] == "Sponsored": each_detail['image_url'] = image_list[i] each_detail['name'] = each_result[1] each_detail['price'] = each_result[3] + each_result[4] each_detail['ratings'] = each_result[2] data['products'].append(each_detail) else: each_detail['image_url'] = image_list[i] each_detail['name'] = each_result[0] each_detail['price'] = each_result[2] + each_result[3] each_detail['ratings'] = each_result[1] data['products'].append(each_detail) i+=1 except: pass else: print(data)
Step 6:- Now we'll store the result in "search_result.json" file and use the below code to stop the web driver.
driver.quit()
Demonstration:- Watch this video to see how it works.
Project Link:-
Thank You For Reading
Very nice project sir.
ReplyDelete