How can I make this PHP script run faster/asynchronously?
I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.
Here is my code:api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}
php regex curl php-curl pastebin
add a comment |
I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.
Here is my code:api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}
php regex curl php-curl pastebin
because Pastebin bans your IP if you hammer too many requestsPS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.
– hanshenrik
Nov 24 '18 at 11:30
I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.
– Mark Adewale
Nov 24 '18 at 11:31
if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.
– hanshenrik
Nov 24 '18 at 11:34
Thanks, @hanshenrik, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:59
add a comment |
I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.
Here is my code:api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}
php regex curl php-curl pastebin
I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.
Here is my code:api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}
php regex curl php-curl pastebin
php regex curl php-curl pastebin
asked Nov 24 '18 at 11:20
Mark AdewaleMark Adewale
154
154
because Pastebin bans your IP if you hammer too many requestsPS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.
– hanshenrik
Nov 24 '18 at 11:30
I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.
– Mark Adewale
Nov 24 '18 at 11:31
if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.
– hanshenrik
Nov 24 '18 at 11:34
Thanks, @hanshenrik, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:59
add a comment |
because Pastebin bans your IP if you hammer too many requestsPS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.
– hanshenrik
Nov 24 '18 at 11:30
I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.
– Mark Adewale
Nov 24 '18 at 11:31
if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.
– hanshenrik
Nov 24 '18 at 11:34
Thanks, @hanshenrik, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:59
because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.– hanshenrik
Nov 24 '18 at 11:30
because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.– hanshenrik
Nov 24 '18 at 11:30
I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.
– Mark Adewale
Nov 24 '18 at 11:31
I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.
– Mark Adewale
Nov 24 '18 at 11:31
if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.
– hanshenrik
Nov 24 '18 at 11:34
if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.
– hanshenrik
Nov 24 '18 at 11:34
Thanks, @hanshenrik, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:59
Thanks, @hanshenrik, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:59
add a comment |
1 Answer
1
active
oldest
votes
524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457600%2fhow-can-i-make-this-php-script-run-faster-asynchronously%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
add a comment |
524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
add a comment |
524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)
524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)
answered Nov 24 '18 at 11:43
hanshenrikhanshenrik
10.2k21839
10.2k21839
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
add a comment |
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
Thanks for the help, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:58
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457600%2fhow-can-i-make-this-php-script-run-faster-asynchronously%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
because Pastebin bans your IP if you hammer too many requestsPS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.– hanshenrik
Nov 24 '18 at 11:30
I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.
– Mark Adewale
Nov 24 '18 at 11:31
if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.
– hanshenrik
Nov 24 '18 at 11:34
Thanks, @hanshenrik, ended up buying PRO in the end :)
– Mark Adewale
Nov 24 '18 at 11:59