How can I make this PHP script run faster/asynchronously?












-1















I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.



Here is what my script is doing:

- Scraping Pastebin links from https://psbdmp.ws/dumps

- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy

- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.


The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.

Here is my code:
api.php



    function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);

// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}


index.php



require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";

$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}









share|improve this question























  • because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.

    – hanshenrik
    Nov 24 '18 at 11:30













  • I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.

    – Mark Adewale
    Nov 24 '18 at 11:31











  • if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.

    – hanshenrik
    Nov 24 '18 at 11:34













  • Thanks, @hanshenrik, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:59
















-1















I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.



Here is what my script is doing:

- Scraping Pastebin links from https://psbdmp.ws/dumps

- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy

- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.


The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.

Here is my code:
api.php



    function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);

// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}


index.php



require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";

$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}









share|improve this question























  • because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.

    – hanshenrik
    Nov 24 '18 at 11:30













  • I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.

    – Mark Adewale
    Nov 24 '18 at 11:31











  • if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.

    – hanshenrik
    Nov 24 '18 at 11:34













  • Thanks, @hanshenrik, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:59














-1












-1








-1


2






I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.



Here is what my script is doing:

- Scraping Pastebin links from https://psbdmp.ws/dumps

- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy

- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.


The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.

Here is my code:
api.php



    function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);

// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}


index.php



require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";

$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}









share|improve this question














I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.



Here is what my script is doing:

- Scraping Pastebin links from https://psbdmp.ws/dumps

- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy

- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.


The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.

Here is my code:
api.php



    function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);

// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}


index.php



require('api.php');
$expression = "/(?:https://pastebin.com/w+)/";

$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}






php regex curl php-curl pastebin






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 24 '18 at 11:20









Mark AdewaleMark Adewale

154




154













  • because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.

    – hanshenrik
    Nov 24 '18 at 11:30













  • I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.

    – Mark Adewale
    Nov 24 '18 at 11:31











  • if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.

    – hanshenrik
    Nov 24 '18 at 11:34













  • Thanks, @hanshenrik, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:59



















  • because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.

    – hanshenrik
    Nov 24 '18 at 11:30













  • I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.

    – Mark Adewale
    Nov 24 '18 at 11:31











  • if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.

    – hanshenrik
    Nov 24 '18 at 11:34













  • Thanks, @hanshenrik, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:59

















because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.

– hanshenrik
Nov 24 '18 at 11:30







because Pastebin bans your IP if you hammer too many requests PS, you can buy a Pastebin.com Pro account, and you get access to a scraping that isn't limited.

– hanshenrik
Nov 24 '18 at 11:30















I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.

– Mark Adewale
Nov 24 '18 at 11:31





I've already tried opening pastebin with a proxy, and it works, but as soon as I try using the proxy API is stops.

– Mark Adewale
Nov 24 '18 at 11:31













if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.

– hanshenrik
Nov 24 '18 at 11:34







if you buy a pastebin.com pro account, you get access to the scraping api ( pastebin.com/doc_scraping_api ) which can fetch about 86400 pastes per day without getting ip banned.

– hanshenrik
Nov 24 '18 at 11:34















Thanks, @hanshenrik, ended up buying PRO in the end :)

– Mark Adewale
Nov 24 '18 at 11:59





Thanks, @hanshenrik, ended up buying PRO in the end :)

– Mark Adewale
Nov 24 '18 at 11:59












1 Answer
1






active

oldest

votes


















0














524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.



because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.



and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)






share|improve this answer
























  • Thanks for the help, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:58











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457600%2fhow-can-i-make-this-php-script-run-faster-asynchronously%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.



because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.



and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)






share|improve this answer
























  • Thanks for the help, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:58
















0














524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.



because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.



and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)






share|improve this answer
























  • Thanks for the help, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:58














0












0








0







524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.



because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.



and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)






share|improve this answer













524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.



because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.



and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 24 '18 at 11:43









hanshenrikhanshenrik

10.2k21839




10.2k21839













  • Thanks for the help, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:58



















  • Thanks for the help, ended up buying PRO in the end :)

    – Mark Adewale
    Nov 24 '18 at 11:58

















Thanks for the help, ended up buying PRO in the end :)

– Mark Adewale
Nov 24 '18 at 11:58





Thanks for the help, ended up buying PRO in the end :)

– Mark Adewale
Nov 24 '18 at 11:58




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457600%2fhow-can-i-make-this-php-script-run-faster-asynchronously%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Tonle Sap (See)

I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

Guatemaltekische Davis-Cup-Mannschaft