String manipulation: calculate the “similarity of a string with its suffixes”












5















For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.



The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.










share|improve this question




















  • 3





    So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

    – bobbymcr
    Dec 15 '11 at 19:49











  • Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

    – bicepjai
    Aug 15 '12 at 9:16











  • Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

    – tripleee
    Aug 15 '12 at 9:23
















5















For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.



The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.










share|improve this question




















  • 3





    So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

    – bobbymcr
    Dec 15 '11 at 19:49











  • Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

    – bicepjai
    Aug 15 '12 at 9:16











  • Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

    – tripleee
    Aug 15 '12 at 9:23














5












5








5


3






For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.



The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.










share|improve this question
















For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.



The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.







string algorithm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 15 '11 at 19:46









Dan Fego

10.4k23653




10.4k23653










asked Dec 15 '11 at 19:44









Zhang FengZhang Feng

6216




6216








  • 3





    So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

    – bobbymcr
    Dec 15 '11 at 19:49











  • Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

    – bicepjai
    Aug 15 '12 at 9:16











  • Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

    – tripleee
    Aug 15 '12 at 9:23














  • 3





    So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

    – bobbymcr
    Dec 15 '11 at 19:49











  • Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

    – bicepjai
    Aug 15 '12 at 9:16











  • Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

    – tripleee
    Aug 15 '12 at 9:23








3




3





So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49





So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49













Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16





Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16













Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23





Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23












2 Answers
2






active

oldest

votes


















9














You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.



EXAMPLE: Your string is ababaa, so the suffix array looks like this:



5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa


where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.



As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.






share|improve this answer


























  • :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

    – Zhang Feng
    Dec 15 '11 at 20:04











  • @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

    – Avinash
    Jan 3 '12 at 4:00






  • 2





    @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

    – coding_pleasures
    May 19 '13 at 8:56













  • @PengOne What to do after making suffix array or even LCP?

    – shiva
    Oct 5 '16 at 20:39





















0














Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:



def z_func(s):
z = [0]*len(s)
l, r = 0, 0
for i in range(1,len(s)):
if i<=r:
z[i] = min(r-i+1, z[i-l])
while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
z[i] += 1
if z[i]+i-1 > r:
l, r = i, z[i]+i-1
return sum(z)+len(s)





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f8525692%2fstring-manipulation-calculate-the-similarity-of-a-string-with-its-suffixes%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    9














    You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.



    EXAMPLE: Your string is ababaa, so the suffix array looks like this:



    5 | a
    4 | aa
    2 | abaa
    0 | ababaa
    3 | baa
    1 | babaa


    where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.



    As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.






    share|improve this answer


























    • :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

      – Zhang Feng
      Dec 15 '11 at 20:04











    • @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

      – Avinash
      Jan 3 '12 at 4:00






    • 2





      @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

      – coding_pleasures
      May 19 '13 at 8:56













    • @PengOne What to do after making suffix array or even LCP?

      – shiva
      Oct 5 '16 at 20:39


















    9














    You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.



    EXAMPLE: Your string is ababaa, so the suffix array looks like this:



    5 | a
    4 | aa
    2 | abaa
    0 | ababaa
    3 | baa
    1 | babaa


    where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.



    As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.






    share|improve this answer


























    • :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

      – Zhang Feng
      Dec 15 '11 at 20:04











    • @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

      – Avinash
      Jan 3 '12 at 4:00






    • 2





      @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

      – coding_pleasures
      May 19 '13 at 8:56













    • @PengOne What to do after making suffix array or even LCP?

      – shiva
      Oct 5 '16 at 20:39
















    9












    9








    9







    You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.



    EXAMPLE: Your string is ababaa, so the suffix array looks like this:



    5 | a
    4 | aa
    2 | abaa
    0 | ababaa
    3 | baa
    1 | babaa


    where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.



    As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.






    share|improve this answer















    You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.



    EXAMPLE: Your string is ababaa, so the suffix array looks like this:



    5 | a
    4 | aa
    2 | abaa
    0 | ababaa
    3 | baa
    1 | babaa


    where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.



    As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 15 '11 at 20:03

























    answered Dec 15 '11 at 19:49









    PengOnePengOne

    44.1k16115145




    44.1k16115145













    • :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

      – Zhang Feng
      Dec 15 '11 at 20:04











    • @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

      – Avinash
      Jan 3 '12 at 4:00






    • 2





      @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

      – coding_pleasures
      May 19 '13 at 8:56













    • @PengOne What to do after making suffix array or even LCP?

      – shiva
      Oct 5 '16 at 20:39





















    • :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

      – Zhang Feng
      Dec 15 '11 at 20:04











    • @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

      – Avinash
      Jan 3 '12 at 4:00






    • 2





      @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

      – coding_pleasures
      May 19 '13 at 8:56













    • @PengOne What to do after making suffix array or even LCP?

      – shiva
      Oct 5 '16 at 20:39



















    :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

    – Zhang Feng
    Dec 15 '11 at 20:04





    :thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

    – Zhang Feng
    Dec 15 '11 at 20:04













    @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

    – Avinash
    Jan 3 '12 at 4:00





    @PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

    – Avinash
    Jan 3 '12 at 4:00




    2




    2





    @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

    – coding_pleasures
    May 19 '13 at 8:56







    @PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

    – coding_pleasures
    May 19 '13 at 8:56















    @PengOne What to do after making suffix array or even LCP?

    – shiva
    Oct 5 '16 at 20:39







    @PengOne What to do after making suffix array or even LCP?

    – shiva
    Oct 5 '16 at 20:39















    0














    Read this link about z-algorithm first.
    An O(n) solution based on the algorithm from the link implemented on python:



    def z_func(s):
    z = [0]*len(s)
    l, r = 0, 0
    for i in range(1,len(s)):
    if i<=r:
    z[i] = min(r-i+1, z[i-l])
    while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
    z[i] += 1
    if z[i]+i-1 > r:
    l, r = i, z[i]+i-1
    return sum(z)+len(s)





    share|improve this answer






























      0














      Read this link about z-algorithm first.
      An O(n) solution based on the algorithm from the link implemented on python:



      def z_func(s):
      z = [0]*len(s)
      l, r = 0, 0
      for i in range(1,len(s)):
      if i<=r:
      z[i] = min(r-i+1, z[i-l])
      while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
      z[i] += 1
      if z[i]+i-1 > r:
      l, r = i, z[i]+i-1
      return sum(z)+len(s)





      share|improve this answer




























        0












        0








        0







        Read this link about z-algorithm first.
        An O(n) solution based on the algorithm from the link implemented on python:



        def z_func(s):
        z = [0]*len(s)
        l, r = 0, 0
        for i in range(1,len(s)):
        if i<=r:
        z[i] = min(r-i+1, z[i-l])
        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
        z[i] += 1
        if z[i]+i-1 > r:
        l, r = i, z[i]+i-1
        return sum(z)+len(s)





        share|improve this answer















        Read this link about z-algorithm first.
        An O(n) solution based on the algorithm from the link implemented on python:



        def z_func(s):
        z = [0]*len(s)
        l, r = 0, 0
        for i in range(1,len(s)):
        if i<=r:
        z[i] = min(r-i+1, z[i-l])
        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
        z[i] += 1
        if z[i]+i-1 > r:
        l, r = i, z[i]+i-1
        return sum(z)+len(s)






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 25 '18 at 23:45

























        answered Nov 25 '18 at 5:05









        naimettinaimetti

        364




        364






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f8525692%2fstring-manipulation-calculate-the-similarity-of-a-string-with-its-suffixes%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Tonle Sap (See)

            I get strange results when I access the Sqlitedatabase with Unity C# via XAMPP

            Guatemaltekische Davis-Cup-Mannschaft