Check if MD5 value exists in an index file





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







-2















I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.



Below is my code



The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.



Any ideas?



def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()


def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")

fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')

for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)

print("Closing..")
afile.close()









share|improve this question

























  • You are writing all hashes on one long line into the file. The md5url in line check will still succeed however, newlines do not play a role there, in tests for containment, not equality.

    – Martijn Pieters
    Nov 26 '18 at 19:43




















-2















I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.



Below is my code



The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.



Any ideas?



def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()


def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")

fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')

for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)

print("Closing..")
afile.close()









share|improve this question

























  • You are writing all hashes on one long line into the file. The md5url in line check will still succeed however, newlines do not play a role there, in tests for containment, not equality.

    – Martijn Pieters
    Nov 26 '18 at 19:43
















-2












-2








-2








I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.



Below is my code



The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.



Any ideas?



def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()


def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")

fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')

for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)

print("Closing..")
afile.close()









share|improve this question
















I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.



Below is my code



The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line is not getting executed, probably because am not using 'n' as a suffix while adding the hash to the file. But I tried that its still not working.



Any ideas?



def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()


def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")

fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')

for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)

print("Closing..")
afile.close()






python md5 scanning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 26 '18 at 19:41









Martijn Pieters

726k14325492349




726k14325492349










asked Nov 26 '18 at 19:37









SudheejSudheej

73531334




73531334













  • You are writing all hashes on one long line into the file. The md5url in line check will still succeed however, newlines do not play a role there, in tests for containment, not equality.

    – Martijn Pieters
    Nov 26 '18 at 19:43





















  • You are writing all hashes on one long line into the file. The md5url in line check will still succeed however, newlines do not play a role there, in tests for containment, not equality.

    – Martijn Pieters
    Nov 26 '18 at 19:43



















You are writing all hashes on one long line into the file. The md5url in line check will still succeed however, newlines do not play a role there, in tests for containment, not equality.

– Martijn Pieters
Nov 26 '18 at 19:43







You are writing all hashes on one long line into the file. The md5url in line check will still succeed however, newlines do not play a role there, in tests for containment, not equality.

– Martijn Pieters
Nov 26 '18 at 19:43














1 Answer
1






active

oldest

votes


















1














You are testing in a loop. For every line that doesn't match, you download:



line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download


If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.



The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:



try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()


then when testing new hashes use:



urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + 'n')





share|improve this answer


























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487884%2fcheck-if-md5-value-exists-in-an-index-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    You are testing in a loop. For every line that doesn't match, you download:



    line1
    if hash in line:
    print something
    else
    download
    line2
    if hash in line:
    print something
    else
    download
    line3
    if hash in line:
    print something
    else
    download


    If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.



    The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:



    try:
    with open(fn) as hashfile:
    hashes = {line.strip() for line in hashfile}
    except IOError:
    # no file yet, just use an empty set
    hashes = set()


    then when testing new hashes use:



    urlhash = computeMD5hash(formation)
    if urlhash not in hashes:
    # not seen before, download
    # record the hash
    hashes.add(urlhash)
    with open(fn, 'a') as hashfile:
    hashfile.write(urlhash + 'n')





    share|improve this answer






























      1














      You are testing in a loop. For every line that doesn't match, you download:



      line1
      if hash in line:
      print something
      else
      download
      line2
      if hash in line:
      print something
      else
      download
      line3
      if hash in line:
      print something
      else
      download


      If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.



      The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:



      try:
      with open(fn) as hashfile:
      hashes = {line.strip() for line in hashfile}
      except IOError:
      # no file yet, just use an empty set
      hashes = set()


      then when testing new hashes use:



      urlhash = computeMD5hash(formation)
      if urlhash not in hashes:
      # not seen before, download
      # record the hash
      hashes.add(urlhash)
      with open(fn, 'a') as hashfile:
      hashfile.write(urlhash + 'n')





      share|improve this answer




























        1












        1








        1







        You are testing in a loop. For every line that doesn't match, you download:



        line1
        if hash in line:
        print something
        else
        download
        line2
        if hash in line:
        print something
        else
        download
        line3
        if hash in line:
        print something
        else
        download


        If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.



        The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:



        try:
        with open(fn) as hashfile:
        hashes = {line.strip() for line in hashfile}
        except IOError:
        # no file yet, just use an empty set
        hashes = set()


        then when testing new hashes use:



        urlhash = computeMD5hash(formation)
        if urlhash not in hashes:
        # not seen before, download
        # record the hash
        hashes.add(urlhash)
        with open(fn, 'a') as hashfile:
        hashfile.write(urlhash + 'n')





        share|improve this answer















        You are testing in a loop. For every line that doesn't match, you download:



        line1
        if hash in line:
        print something
        else
        download
        line2
        if hash in line:
        print something
        else
        download
        line3
        if hash in line:
        print something
        else
        download


        If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.



        The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:



        try:
        with open(fn) as hashfile:
        hashes = {line.strip() for line in hashfile}
        except IOError:
        # no file yet, just use an empty set
        hashes = set()


        then when testing new hashes use:



        urlhash = computeMD5hash(formation)
        if urlhash not in hashes:
        # not seen before, download
        # record the hash
        hashes.add(urlhash)
        with open(fn, 'a') as hashfile:
        hashfile.write(urlhash + 'n')






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 26 '18 at 20:11

























        answered Nov 26 '18 at 19:50









        Martijn PietersMartijn Pieters

        726k14325492349




        726k14325492349
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487884%2fcheck-if-md5-value-exists-in-an-index-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen