String manipulation: calculate the “similarity of a string with its suffixes”

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

asked Dec 15 '11 at 19:44

Zhang Feng

6216

3

So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49

Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16

Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23

add a comment |

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

asked Dec 15 '11 at 19:44

Zhang Feng

6216

3

So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49

Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16

Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23

add a comment |

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

asked Dec 15 '11 at 19:44

Zhang Feng

6216

string algorithm

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

asked Dec 15 '11 at 19:44

Zhang Feng

6216

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

asked Dec 15 '11 at 19:44

Zhang Feng

6216

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

edited Dec 15 '11 at 19:46

Dan Fego

10.4k23653

asked Dec 15 '11 at 19:44

Zhang Feng

6216

asked Dec 15 '11 at 19:44

Zhang Feng

6216

asked Dec 15 '11 at 19:44

Zhang Feng

6216

3

So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49

Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16

Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23

add a comment |

3

So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49

Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16

Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23

So did you try to solve it? Where did you get stuck? What kind of help are you looking for?

– bobbymcr
Dec 15 '11 at 19:49

Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993

– bicepjai
Aug 15 '12 at 9:16

Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.

– tripleee
Aug 15 '12 at 9:23

add a comment |

2 Answers
2

active

oldest

votes

You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.

EXAMPLE: Your string is ababaa, so the suffix array looks like this:

5 | a

4 | aa

2 | abaa

0 | ababaa

3 | baa

1 | babaa

where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.

As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.

edited Dec 15 '11 at 20:03

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

– Zhang Feng
Dec 15 '11 at 20:04

@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

– Avinash
Jan 3 '12 at 4:00

2

@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

– coding_pleasures
May 19 '13 at 8:56

@PengOne What to do after making suffix array or even LCP?

– shiva
Oct 5 '16 at 20:39

add a comment |

Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:

def z_func(s):

    z = [0]*len(s)

    l, r = 0, 0

    for i in range(1,len(s)):

        if i<=r:

            z[i] = min(r-i+1, z[i-l])

        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):

            z[i] += 1

        if z[i]+i-1 > r:

            l, r = i, z[i]+i-1

    return sum(z)+len(s)

edited Nov 25 '18 at 23:45

answered Nov 25 '18 at 5:05

naimetti

364

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f8525692%2fstring-manipulation-calculate-the-similarity-of-a-string-with-its-suffixes%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

EXAMPLE: Your string is ababaa, so the suffix array looks like this:

5 | a

4 | aa

2 | abaa

0 | ababaa

3 | baa

1 | babaa

where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.

As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.

edited Dec 15 '11 at 20:03

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

– Zhang Feng
Dec 15 '11 at 20:04

@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

– Avinash
Jan 3 '12 at 4:00

2

@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

– coding_pleasures
May 19 '13 at 8:56

@PengOne What to do after making suffix array or even LCP?

– shiva
Oct 5 '16 at 20:39

add a comment |

EXAMPLE: Your string is ababaa, so the suffix array looks like this:

5 | a

4 | aa

2 | abaa

0 | ababaa

3 | baa

1 | babaa

where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.

As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.

edited Dec 15 '11 at 20:03

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

– Zhang Feng
Dec 15 '11 at 20:04

@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

– Avinash
Jan 3 '12 at 4:00

2

@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

– coding_pleasures
May 19 '13 at 8:56

@PengOne What to do after making suffix array or even LCP?

– shiva
Oct 5 '16 at 20:39

add a comment |

EXAMPLE: Your string is ababaa, so the suffix array looks like this:

5 | a

4 | aa

2 | abaa

0 | ababaa

3 | baa

1 | babaa

where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.

As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.

edited Dec 15 '11 at 20:03

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

EXAMPLE: Your string is ababaa, so the suffix array looks like this:

5 | a

4 | aa

2 | abaa

0 | ababaa

3 | baa

1 | babaa

where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.

As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.

edited Dec 15 '11 at 20:03

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

edited Dec 15 '11 at 20:03

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

answered Dec 15 '11 at 19:49

PengOne

44.1k16115145

:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

– Zhang Feng
Dec 15 '11 at 20:04

@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

– Avinash
Jan 3 '12 at 4:00

2

@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

– coding_pleasures
May 19 '13 at 8:56

@PengOne What to do after making suffix array or even LCP?

– shiva
Oct 5 '16 at 20:39

add a comment |

:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

– Zhang Feng
Dec 15 '11 at 20:04

@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

– Avinash
Jan 3 '12 at 4:00

2

@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

– coding_pleasures
May 19 '13 at 8:56

@PengOne What to do after making suffix array or even LCP?

– shiva
Oct 5 '16 at 20:39

:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help

– Zhang Feng
Dec 15 '11 at 20:04

@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.

– Avinash
Jan 3 '12 at 4:00

@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.

– coding_pleasures
May 19 '13 at 8:56

@PengOne What to do after making suffix array or even LCP?

– shiva
Oct 5 '16 at 20:39

add a comment |

Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:

def z_func(s):

    z = [0]*len(s)

    l, r = 0, 0

    for i in range(1,len(s)):

        if i<=r:

            z[i] = min(r-i+1, z[i-l])

        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):

            z[i] += 1

        if z[i]+i-1 > r:

            l, r = i, z[i]+i-1

    return sum(z)+len(s)

edited Nov 25 '18 at 23:45

answered Nov 25 '18 at 5:05

naimetti

364

add a comment |

Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:

def z_func(s):

    z = [0]*len(s)

    l, r = 0, 0

    for i in range(1,len(s)):

        if i<=r:

            z[i] = min(r-i+1, z[i-l])

        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):

            z[i] += 1

        if z[i]+i-1 > r:

            l, r = i, z[i]+i-1

    return sum(z)+len(s)

edited Nov 25 '18 at 23:45

answered Nov 25 '18 at 5:05

naimetti

364

add a comment |

Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:

def z_func(s):

    z = [0]*len(s)

    l, r = 0, 0

    for i in range(1,len(s)):

        if i<=r:

            z[i] = min(r-i+1, z[i-l])

        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):

            z[i] += 1

        if z[i]+i-1 > r:

            l, r = i, z[i]+i-1

    return sum(z)+len(s)

edited Nov 25 '18 at 23:45

answered Nov 25 '18 at 5:05

naimetti

364

Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:

def z_func(s):

    z = [0]*len(s)

    l, r = 0, 0

    for i in range(1,len(s)):

        if i<=r:

            z[i] = min(r-i+1, z[i-l])

        while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):

            z[i] += 1

        if z[i]+i-1 > r:

            l, r = i, z[i]+i-1

    return sum(z)+len(s)

edited Nov 25 '18 at 23:45

answered Nov 25 '18 at 5:05

naimetti

364

edited Nov 25 '18 at 23:45

answered Nov 25 '18 at 5:05

naimetti

364

answered Nov 25 '18 at 5:05

naimetti

364

answered Nov 25 '18 at 5:05

naimetti

364

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg