Check if MD5 value exists in an index file
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.
Below is my code
The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line
is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.
Any ideas?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
python md5 scanning
add a comment |
I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.
Below is my code
The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line
is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.
Any ideas?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
python md5 scanning
You are writing all hashes on one long line into the file. Themd5url in line
check will still succeed however, newlines do not play a role there,in
tests for containment, not equality.
– Martijn Pieters♦
Nov 26 '18 at 19:43
add a comment |
I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.
Below is my code
The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line
is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.
Any ideas?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
python md5 scanning
I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.
Below is my code
The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line
is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.
Any ideas?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
python md5 scanning
python md5 scanning
edited Nov 26 '18 at 19:41
Martijn Pieters♦
726k14325492349
726k14325492349
asked Nov 26 '18 at 19:37
SudheejSudheej
73531334
73531334
You are writing all hashes on one long line into the file. Themd5url in line
check will still succeed however, newlines do not play a role there,in
tests for containment, not equality.
– Martijn Pieters♦
Nov 26 '18 at 19:43
add a comment |
You are writing all hashes on one long line into the file. Themd5url in line
check will still succeed however, newlines do not play a role there,in
tests for containment, not equality.
– Martijn Pieters♦
Nov 26 '18 at 19:43
You are writing all hashes on one long line into the file. The
md5url in line
check will still succeed however, newlines do not play a role there, in
tests for containment, not equality.– Martijn Pieters♦
Nov 26 '18 at 19:43
You are writing all hashes on one long line into the file. The
md5url in line
check will still succeed however, newlines do not play a role there, in
tests for containment, not equality.– Martijn Pieters♦
Nov 26 '18 at 19:43
add a comment |
1 Answer
1
active
oldest
votes
You are testing in a loop. For every line that doesn't match, you download:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.
The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
then when testing new hashes use:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + 'n')
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487884%2fcheck-if-md5-value-exists-in-an-index-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You are testing in a loop. For every line that doesn't match, you download:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.
The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
then when testing new hashes use:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + 'n')
add a comment |
You are testing in a loop. For every line that doesn't match, you download:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.
The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
then when testing new hashes use:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + 'n')
add a comment |
You are testing in a loop. For every line that doesn't match, you download:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.
The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
then when testing new hashes use:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + 'n')
You are testing in a loop. For every line that doesn't match, you download:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.
The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
then when testing new hashes use:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + 'n')
edited Nov 26 '18 at 20:11
answered Nov 26 '18 at 19:50
Martijn Pieters♦Martijn Pieters
726k14325492349
726k14325492349
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487884%2fcheck-if-md5-value-exists-in-an-index-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You are writing all hashes on one long line into the file. The
md5url in line
check will still succeed however, newlines do not play a role there,in
tests for containment, not equality.– Martijn Pieters♦
Nov 26 '18 at 19:43