String manipulation: calculate the “similarity of a string with its suffixes”
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.
string algorithm
add a comment |
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.
string algorithm
3
So did you try to solve it? Where did you get stuck? What kind of help are you looking for?
– bobbymcr
Dec 15 '11 at 19:49
Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993
– bicepjai
Aug 15 '12 at 9:16
Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.
– tripleee
Aug 15 '12 at 9:23
add a comment |
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.
string algorithm
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
The problem is to give an algorithm to calculate the sum of similarities of a string S with each of it's suffixes. For example, let the string be : ababaa. Then, the suffixes of the string are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string ababaa are 6,0,3,0,1,1, respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.
string algorithm
string algorithm
edited Dec 15 '11 at 19:46
Dan Fego
10.4k23653
10.4k23653
asked Dec 15 '11 at 19:44
Zhang FengZhang Feng
6216
6216
3
So did you try to solve it? Where did you get stuck? What kind of help are you looking for?
– bobbymcr
Dec 15 '11 at 19:49
Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993
– bicepjai
Aug 15 '12 at 9:16
Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.
– tripleee
Aug 15 '12 at 9:23
add a comment |
3
So did you try to solve it? Where did you get stuck? What kind of help are you looking for?
– bobbymcr
Dec 15 '11 at 19:49
Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993
– bicepjai
Aug 15 '12 at 9:16
Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.
– tripleee
Aug 15 '12 at 9:23
3
3
So did you try to solve it? Where did you get stuck? What kind of help are you looking for?
– bobbymcr
Dec 15 '11 at 19:49
So did you try to solve it? Where did you get stuck? What kind of help are you looking for?
– bobbymcr
Dec 15 '11 at 19:49
Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993
– bicepjai
Aug 15 '12 at 9:16
Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993
– bicepjai
Aug 15 '12 at 9:16
Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.
– tripleee
Aug 15 '12 at 9:23
Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.
– tripleee
Aug 15 '12 at 9:23
add a comment |
2 Answers
2
active
oldest
votes
You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.
EXAMPLE: Your string is ababaa, so the suffix array looks like this:
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.
As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
2
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
add a comment |
Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:
def z_func(s):
z = [0]*len(s)
l, r = 0, 0
for i in range(1,len(s)):
if i<=r:
z[i] = min(r-i+1, z[i-l])
while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
z[i] += 1
if z[i]+i-1 > r:
l, r = i, z[i]+i-1
return sum(z)+len(s)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f8525692%2fstring-manipulation-calculate-the-similarity-of-a-string-with-its-suffixes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.
EXAMPLE: Your string is ababaa, so the suffix array looks like this:
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.
As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
2
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
add a comment |
You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.
EXAMPLE: Your string is ababaa, so the suffix array looks like this:
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.
As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
2
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
add a comment |
You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.
EXAMPLE: Your string is ababaa, so the suffix array looks like this:
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.
As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.
You want to consider suffix arrays. The suffix array of a word is the array of the indices of suffixes sorted in lexicographical order. In the linked wikipedia article, the algorithms compute the LCP (longest common prefix) as they compute the suffix array. You can compute this in O(n) using similarities with suffix trees, as shown in this paper.
EXAMPLE: Your string is ababaa, so the suffix array looks like this:
5 | a
4 | aa
2 | abaa
0 | ababaa
3 | baa
1 | babaa
where the number on the left is the index at which the suffix begins. Now it's pretty each to compute prefixes since everything is stored lexicographically.
As a side note, this is closely related to the longest common substring problem. To practice for your next interview, think about ways to solve that efficiently.
edited Dec 15 '11 at 20:03
answered Dec 15 '11 at 19:49
PengOnePengOne
44.1k16115145
44.1k16115145
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
2
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
add a comment |
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
2
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
:thanks a lot for this! I do understand what you are saying, but I am not familiar with suffix trees at all, and given the limited amount of time, I wanted to ask you, if you can looking at Knuth morris prat algotihm will help
– Zhang Feng
Dec 15 '11 at 20:04
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
@PengOne : I tried building Suffix tree, but the number of the left i.e. LCP is between str[i - 1] and str[i], what they are looking for is between "ababaa" and "a" , "ababa" and "aa" , "ababaa" and "abaa" and so on.. where my solution fails becuase here I am doing linear comparasion, is there any more clue.
– Avinash
Jan 3 '12 at 4:00
2
2
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne Could you please explain how to solve this particular problem using suffix arrays? Or what benefit does a lexicographic ordering give to the solution as compared to manually comparing each suffix to the original string.
– coding_pleasures
May 19 '13 at 8:56
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
@PengOne What to do after making suffix array or even LCP?
– shiva
Oct 5 '16 at 20:39
add a comment |
Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:
def z_func(s):
z = [0]*len(s)
l, r = 0, 0
for i in range(1,len(s)):
if i<=r:
z[i] = min(r-i+1, z[i-l])
while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
z[i] += 1
if z[i]+i-1 > r:
l, r = i, z[i]+i-1
return sum(z)+len(s)
add a comment |
Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:
def z_func(s):
z = [0]*len(s)
l, r = 0, 0
for i in range(1,len(s)):
if i<=r:
z[i] = min(r-i+1, z[i-l])
while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
z[i] += 1
if z[i]+i-1 > r:
l, r = i, z[i]+i-1
return sum(z)+len(s)
add a comment |
Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:
def z_func(s):
z = [0]*len(s)
l, r = 0, 0
for i in range(1,len(s)):
if i<=r:
z[i] = min(r-i+1, z[i-l])
while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
z[i] += 1
if z[i]+i-1 > r:
l, r = i, z[i]+i-1
return sum(z)+len(s)
Read this link about z-algorithm first.
An O(n) solution based on the algorithm from the link implemented on python:
def z_func(s):
z = [0]*len(s)
l, r = 0, 0
for i in range(1,len(s)):
if i<=r:
z[i] = min(r-i+1, z[i-l])
while (i + z[i] < len(s) and s[z[i]] == s[i + z[i]]):
z[i] += 1
if z[i]+i-1 > r:
l, r = i, z[i]+i-1
return sum(z)+len(s)
edited Nov 25 '18 at 23:45
answered Nov 25 '18 at 5:05
naimettinaimetti
364
364
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f8525692%2fstring-manipulation-calculate-the-similarity-of-a-string-with-its-suffixes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
So did you try to solve it? Where did you get stuck? What kind of help are you looking for?
– bobbymcr
Dec 15 '11 at 19:49
Implementation of Ukkonen's algorithm to build a prefix tree in O(n) gist.github.com/3355993
– bicepjai
Aug 15 '12 at 9:16
Ukkonen's algorithm is for building a suffix tree, which of course is precisely what this is about. A suffix tree is not exactly the same thing as a suffix array, although they are obviously related.
– tripleee
Aug 15 '12 at 9:23