Python selenium multiprocessing












14















I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.



However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.



My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?



This is my try (it's a working one):



import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
return titles

def get_title(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
driver.get(url)
sauce = BeautifulSoup(driver.page_source,"lxml")
item = sauce.select_one("h1 a").text
print(item)

if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
ThreadPool(5).map(get_title,get_links(url))









share|improve this question




















  • 1





    Anytime multiprocessing comes into play it becomes a good opportunity to consider switching from selenium to headless chrome.

    – pguardiario
    Nov 26 '18 at 7:48






  • 1





    @Qharr - there are node libraries like puppeteer and nightmarejs that are more suited to things like this than selenium. Selenium is more popular because it's been around forever but it's a bit of a dinosaur and mostly suited to simpler scripts. IMHO at least.

    – pguardiario
    Nov 26 '18 at 8:47






  • 1





    No, actually I was suggesting switching from python to node.

    – pguardiario
    Nov 26 '18 at 9:39











  • @pguardiario Thanks. Also learning JS at moment so that is handy.

    – QHarr
    Nov 26 '18 at 9:41











  • Selenium ist the wrong tool for web scraping. Use the opensource Scrapy Python package instead. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database.

    – miraculixx
    Nov 28 '18 at 8:47
















14















I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.



However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.



My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?



This is my try (it's a working one):



import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
return titles

def get_title(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
driver.get(url)
sauce = BeautifulSoup(driver.page_source,"lxml")
item = sauce.select_one("h1 a").text
print(item)

if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
ThreadPool(5).map(get_title,get_links(url))









share|improve this question




















  • 1





    Anytime multiprocessing comes into play it becomes a good opportunity to consider switching from selenium to headless chrome.

    – pguardiario
    Nov 26 '18 at 7:48






  • 1





    @Qharr - there are node libraries like puppeteer and nightmarejs that are more suited to things like this than selenium. Selenium is more popular because it's been around forever but it's a bit of a dinosaur and mostly suited to simpler scripts. IMHO at least.

    – pguardiario
    Nov 26 '18 at 8:47






  • 1





    No, actually I was suggesting switching from python to node.

    – pguardiario
    Nov 26 '18 at 9:39











  • @pguardiario Thanks. Also learning JS at moment so that is handy.

    – QHarr
    Nov 26 '18 at 9:41











  • Selenium ist the wrong tool for web scraping. Use the opensource Scrapy Python package instead. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database.

    – miraculixx
    Nov 28 '18 at 8:47














14












14








14


4






I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.



However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.



My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?



This is my try (it's a working one):



import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
return titles

def get_title(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
driver.get(url)
sauce = BeautifulSoup(driver.page_source,"lxml")
item = sauce.select_one("h1 a").text
print(item)

if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
ThreadPool(5).map(get_title,get_links(url))









share|improve this question
















I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.



However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.



My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?



This is my try (it's a working one):



import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
return titles

def get_title(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
driver.get(url)
sauce = BeautifulSoup(driver.page_source,"lxml")
item = sauce.select_one("h1 a").text
print(item)

if __name__ == '__main__':
url = "https://stackoverflow.com/questions/tagged/web-scraping"
ThreadPool(5).map(get_title,get_links(url))






python python-3.x selenium web-scraping multiprocessing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 28 '18 at 10:10







robots.txt

















asked Nov 26 '18 at 6:10









robots.txtrobots.txt

330118




330118








  • 1





    Anytime multiprocessing comes into play it becomes a good opportunity to consider switching from selenium to headless chrome.

    – pguardiario
    Nov 26 '18 at 7:48






  • 1





    @Qharr - there are node libraries like puppeteer and nightmarejs that are more suited to things like this than selenium. Selenium is more popular because it's been around forever but it's a bit of a dinosaur and mostly suited to simpler scripts. IMHO at least.

    – pguardiario
    Nov 26 '18 at 8:47






  • 1





    No, actually I was suggesting switching from python to node.

    – pguardiario
    Nov 26 '18 at 9:39











  • @pguardiario Thanks. Also learning JS at moment so that is handy.

    – QHarr
    Nov 26 '18 at 9:41











  • Selenium ist the wrong tool for web scraping. Use the opensource Scrapy Python package instead. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database.

    – miraculixx
    Nov 28 '18 at 8:47














  • 1





    Anytime multiprocessing comes into play it becomes a good opportunity to consider switching from selenium to headless chrome.

    – pguardiario
    Nov 26 '18 at 7:48






  • 1





    @Qharr - there are node libraries like puppeteer and nightmarejs that are more suited to things like this than selenium. Selenium is more popular because it's been around forever but it's a bit of a dinosaur and mostly suited to simpler scripts. IMHO at least.

    – pguardiario
    Nov 26 '18 at 8:47






  • 1





    No, actually I was suggesting switching from python to node.

    – pguardiario
    Nov 26 '18 at 9:39











  • @pguardiario Thanks. Also learning JS at moment so that is handy.

    – QHarr
    Nov 26 '18 at 9:41











  • Selenium ist the wrong tool for web scraping. Use the opensource Scrapy Python package instead. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database.

    – miraculixx
    Nov 28 '18 at 8:47








1




1





Anytime multiprocessing comes into play it becomes a good opportunity to consider switching from selenium to headless chrome.

– pguardiario
Nov 26 '18 at 7:48





Anytime multiprocessing comes into play it becomes a good opportunity to consider switching from selenium to headless chrome.

– pguardiario
Nov 26 '18 at 7:48




1




1





@Qharr - there are node libraries like puppeteer and nightmarejs that are more suited to things like this than selenium. Selenium is more popular because it's been around forever but it's a bit of a dinosaur and mostly suited to simpler scripts. IMHO at least.

– pguardiario
Nov 26 '18 at 8:47





@Qharr - there are node libraries like puppeteer and nightmarejs that are more suited to things like this than selenium. Selenium is more popular because it's been around forever but it's a bit of a dinosaur and mostly suited to simpler scripts. IMHO at least.

– pguardiario
Nov 26 '18 at 8:47




1




1





No, actually I was suggesting switching from python to node.

– pguardiario
Nov 26 '18 at 9:39





No, actually I was suggesting switching from python to node.

– pguardiario
Nov 26 '18 at 9:39













@pguardiario Thanks. Also learning JS at moment so that is handy.

– QHarr
Nov 26 '18 at 9:41





@pguardiario Thanks. Also learning JS at moment so that is handy.

– QHarr
Nov 26 '18 at 9:41













Selenium ist the wrong tool for web scraping. Use the opensource Scrapy Python package instead. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database.

– miraculixx
Nov 28 '18 at 8:47





Selenium ist the wrong tool for web scraping. Use the opensource Scrapy Python package instead. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database.

– miraculixx
Nov 28 '18 at 8:47












2 Answers
2






active

oldest

votes


















7





+50










how can I reduce the execution time using selenium when it is made to run using multiprocessing




A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:



(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver


def get_title(url):
driver = get_driver()
driver.get(url)
(...)

(...)


On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.



Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.






share|improve this answer


























  • Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

    – robots.txt
    Nov 28 '18 at 11:46











  • @robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

    – miraculixx
    Nov 28 '18 at 11:53













  • Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

    – robots.txt
    Nov 28 '18 at 12:11






  • 1





    Clean code and beautiful explanation !!!

    – DebanjanB
    Nov 28 '18 at 14:20



















4















My question: how can I reduce the execution time?




Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.



For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.



Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).



import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

def parse(self, response):
for title in response.css('.summary .question-hyperlink'):
yield title.get('href')


To run put this into blogspider.py and run



$ scrapy runspider blogspider.py


See the Scrapy website for a complete tutorial.



Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.






share|improve this answer


























  • I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

    – robots.txt
    Nov 28 '18 at 9:18






  • 1





    Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

    – Nutle
    Nov 28 '18 at 9:32













  • @robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

    – miraculixx
    Nov 28 '18 at 10:13






  • 2





    Why Selenium is the wrong tool? How to handle javascript with scrapy?

    – Miles Davis
    Nov 28 '18 at 10:13






  • 1





    Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

    – SIM
    Nov 28 '18 at 10:22













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53475578%2fpython-selenium-multiprocessing%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









7





+50










how can I reduce the execution time using selenium when it is made to run using multiprocessing




A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:



(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver


def get_title(url):
driver = get_driver()
driver.get(url)
(...)

(...)


On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.



Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.






share|improve this answer


























  • Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

    – robots.txt
    Nov 28 '18 at 11:46











  • @robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

    – miraculixx
    Nov 28 '18 at 11:53













  • Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

    – robots.txt
    Nov 28 '18 at 12:11






  • 1





    Clean code and beautiful explanation !!!

    – DebanjanB
    Nov 28 '18 at 14:20
















7





+50










how can I reduce the execution time using selenium when it is made to run using multiprocessing




A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:



(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver


def get_title(url):
driver = get_driver()
driver.get(url)
(...)

(...)


On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.



Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.






share|improve this answer


























  • Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

    – robots.txt
    Nov 28 '18 at 11:46











  • @robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

    – miraculixx
    Nov 28 '18 at 11:53













  • Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

    – robots.txt
    Nov 28 '18 at 12:11






  • 1





    Clean code and beautiful explanation !!!

    – DebanjanB
    Nov 28 '18 at 14:20














7





+50







7





+50



7




+50






how can I reduce the execution time using selenium when it is made to run using multiprocessing




A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:



(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver


def get_title(url):
driver = get_driver()
driver.get(url)
(...)

(...)


On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.



Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.






share|improve this answer
















how can I reduce the execution time using selenium when it is made to run using multiprocessing




A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:



(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver


def get_title(url):
driver = get_driver()
driver.get(url)
(...)

(...)


On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.



Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 28 '18 at 11:52

























answered Nov 28 '18 at 11:14









miraculixxmiraculixx

6,60922142




6,60922142













  • Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

    – robots.txt
    Nov 28 '18 at 11:46











  • @robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

    – miraculixx
    Nov 28 '18 at 11:53













  • Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

    – robots.txt
    Nov 28 '18 at 12:11






  • 1





    Clean code and beautiful explanation !!!

    – DebanjanB
    Nov 28 '18 at 14:20



















  • Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

    – robots.txt
    Nov 28 '18 at 11:46











  • @robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

    – miraculixx
    Nov 28 '18 at 11:53













  • Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

    – robots.txt
    Nov 28 '18 at 12:11






  • 1





    Clean code and beautiful explanation !!!

    – DebanjanB
    Nov 28 '18 at 14:20

















Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

– robots.txt
Nov 28 '18 at 11:46





Now your solution looks very promising @miraculixx. If you care to paste the full script I'would be very glad to accept your answer because I highly doubt I can implement it myself. This answer definitely deserves upvotes.

– robots.txt
Nov 28 '18 at 11:46













@robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

– miraculixx
Nov 28 '18 at 11:53







@robots.txt glad you like it :-) link to the full script as a gist added (this way the answer here can stay brief i.e. point out only the differences to your script)

– miraculixx
Nov 28 '18 at 11:53















Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

– robots.txt
Nov 28 '18 at 12:11





Let's wait as long as the bounty is on. Accepted your solution already @miraculixx.

– robots.txt
Nov 28 '18 at 12:11




1




1





Clean code and beautiful explanation !!!

– DebanjanB
Nov 28 '18 at 14:20





Clean code and beautiful explanation !!!

– DebanjanB
Nov 28 '18 at 14:20













4















My question: how can I reduce the execution time?




Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.



For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.



Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).



import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

def parse(self, response):
for title in response.css('.summary .question-hyperlink'):
yield title.get('href')


To run put this into blogspider.py and run



$ scrapy runspider blogspider.py


See the Scrapy website for a complete tutorial.



Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.






share|improve this answer


























  • I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

    – robots.txt
    Nov 28 '18 at 9:18






  • 1





    Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

    – Nutle
    Nov 28 '18 at 9:32













  • @robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

    – miraculixx
    Nov 28 '18 at 10:13






  • 2





    Why Selenium is the wrong tool? How to handle javascript with scrapy?

    – Miles Davis
    Nov 28 '18 at 10:13






  • 1





    Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

    – SIM
    Nov 28 '18 at 10:22


















4















My question: how can I reduce the execution time?




Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.



For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.



Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).



import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

def parse(self, response):
for title in response.css('.summary .question-hyperlink'):
yield title.get('href')


To run put this into blogspider.py and run



$ scrapy runspider blogspider.py


See the Scrapy website for a complete tutorial.



Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.






share|improve this answer


























  • I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

    – robots.txt
    Nov 28 '18 at 9:18






  • 1





    Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

    – Nutle
    Nov 28 '18 at 9:32













  • @robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

    – miraculixx
    Nov 28 '18 at 10:13






  • 2





    Why Selenium is the wrong tool? How to handle javascript with scrapy?

    – Miles Davis
    Nov 28 '18 at 10:13






  • 1





    Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

    – SIM
    Nov 28 '18 at 10:22
















4












4








4








My question: how can I reduce the execution time?




Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.



For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.



Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).



import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

def parse(self, response):
for title in response.css('.summary .question-hyperlink'):
yield title.get('href')


To run put this into blogspider.py and run



$ scrapy runspider blogspider.py


See the Scrapy website for a complete tutorial.



Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.






share|improve this answer
















My question: how can I reduce the execution time?




Selenium seems the wrong tool for web scraping - though I appreciate YMMV, in particular if you need to simulate user interaction with the web site or there is some JavaScript limitation/requirement.



For scraping tasks without much interaction, I have had good results using the opensource Scrapy Python package for large-scale scrapying tasks. It does multiprocessing out of the box, it is easy to write new scripts and store the data in files or a database -- and it is really fast.



Your script would look something like this when implemented as a fully parallel Scrapy spider (note I did not test this, see documentation on selectors).



import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

def parse(self, response):
for title in response.css('.summary .question-hyperlink'):
yield title.get('href')


To run put this into blogspider.py and run



$ scrapy runspider blogspider.py


See the Scrapy website for a complete tutorial.



Note that Scrapy also supports JavaScript through scrapy-splash, thanks to the pointer by @SIM. I didn't have any exposure with that so far so can't speak to this other than it looks well integrated with how Scrapy works.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 28 '18 at 11:28

























answered Nov 28 '18 at 9:01









miraculixxmiraculixx

6,60922142




6,60922142













  • I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

    – robots.txt
    Nov 28 '18 at 9:18






  • 1





    Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

    – Nutle
    Nov 28 '18 at 9:32













  • @robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

    – miraculixx
    Nov 28 '18 at 10:13






  • 2





    Why Selenium is the wrong tool? How to handle javascript with scrapy?

    – Miles Davis
    Nov 28 '18 at 10:13






  • 1





    Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

    – SIM
    Nov 28 '18 at 10:22





















  • I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

    – robots.txt
    Nov 28 '18 at 9:18






  • 1





    Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

    – Nutle
    Nov 28 '18 at 9:32













  • @robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

    – miraculixx
    Nov 28 '18 at 10:13






  • 2





    Why Selenium is the wrong tool? How to handle javascript with scrapy?

    – Miles Davis
    Nov 28 '18 at 10:13






  • 1





    Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

    – SIM
    Nov 28 '18 at 10:22



















I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

– robots.txt
Nov 28 '18 at 9:18





I'm overly startled to see that I've got a solution on scrapy. Ain't the title of my post explicit enough what I wish to accomplish?

– robots.txt
Nov 28 '18 at 9:18




1




1





Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

– Nutle
Nov 28 '18 at 9:32







Selenium ist the wrong tool for web scraping. Use the opensource Scrapy is it though?

– Nutle
Nov 28 '18 at 9:32















@robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

– miraculixx
Nov 28 '18 at 10:13





@robots.txt no, your question is how can I reduce execution time. As I asked in a previous comment, please specify results of your attempt vv expectations, you may get better answers.

– miraculixx
Nov 28 '18 at 10:13




2




2





Why Selenium is the wrong tool? How to handle javascript with scrapy?

– Miles Davis
Nov 28 '18 at 10:13





Why Selenium is the wrong tool? How to handle javascript with scrapy?

– Miles Davis
Nov 28 '18 at 10:13




1




1





Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

– SIM
Nov 28 '18 at 10:22







Selenium definitely is not the wrong tool when it comes to scrape content from websites irrespective of the same being dynamic or not. However, in case of scrapy there is a lightweight tool splash available out there to do the trick as well.

– SIM
Nov 28 '18 at 10:22




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53475578%2fpython-selenium-multiprocessing%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

To store a contact into the json file from server.js file using a class in NodeJS

Marschland

Redirect URL with Chrome Remote Debugging Android Devices