JavaScript map a regex with multiple matches at the right occurrence
I have an array
of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex
will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.
By example, given the text
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
the first occurrence of the token down
is mapped to the last position in the text matched by the RegExp
, hence I have:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}
This becomes clear running this example:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
where the first occurrence of the token down
should be the first matching position:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}
So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down
with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin
and charOffsetEnd
hence doing like:
var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});
javascript regex text-processing
|
show 3 more comments
I have an array
of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex
will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.
By example, given the text
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
the first occurrence of the token down
is mapped to the last position in the text matched by the RegExp
, hence I have:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}
This becomes clear running this example:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
where the first occurrence of the token down
should be the first matching position:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}
So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down
with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin
and charOffsetEnd
hence doing like:
var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});
javascript regex text-processing
1
Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33
The problem is not the expression is greedy, the problem is that you are looking for every match via thewhile
loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch thewhile
loop.
– Felix Kling
Nov 21 at 0:33
@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches liketokenMap[ matchIndex ]
, but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42
I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49
1
The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51
|
show 3 more comments
I have an array
of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex
will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.
By example, given the text
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
the first occurrence of the token down
is mapped to the last position in the text matched by the RegExp
, hence I have:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}
This becomes clear running this example:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
where the first occurrence of the token down
should be the first matching position:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}
So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down
with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin
and charOffsetEnd
hence doing like:
var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});
javascript regex text-processing
I have an array
of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex
will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.
By example, given the text
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
the first occurrence of the token down
is mapped to the last position in the text matched by the RegExp
, hence I have:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}
This becomes clear running this example:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
where the first occurrence of the token down
should be the first matching position:
{
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}
So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down
with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin
and charOffsetEnd
hence doing like:
var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)
javascript regex text-processing
javascript regex text-processing
edited Nov 21 at 0:57
asked Nov 20 at 23:47
loretoparisi
7,57754771
7,57754771
1
Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33
The problem is not the expression is greedy, the problem is that you are looking for every match via thewhile
loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch thewhile
loop.
– Felix Kling
Nov 21 at 0:33
@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches liketokenMap[ matchIndex ]
, but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42
I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49
1
The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51
|
show 3 more comments
1
Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33
The problem is not the expression is greedy, the problem is that you are looking for every match via thewhile
loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch thewhile
loop.
– Felix Kling
Nov 21 at 0:33
@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches liketokenMap[ matchIndex ]
, but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42
I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49
1
The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51
1
1
Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33
Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33
The problem is not the expression is greedy, the problem is that you are looking for every match via the
while
loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while
loop.– Felix Kling
Nov 21 at 0:33
The problem is not the expression is greedy, the problem is that you are looking for every match via the
while
loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while
loop.– Felix Kling
Nov 21 at 0:33
@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like
tokenMap[ matchIndex ]
, but I'm not sure of it.– loretoparisi
Nov 21 at 0:42
@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like
tokenMap[ matchIndex ]
, but I'm not sure of it.– loretoparisi
Nov 21 at 0:42
I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49
I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49
1
1
The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51
The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51
|
show 3 more comments
2 Answers
2
active
oldest
votes
The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while
loop.
You have to do two things:
- Stop iterating once you found a match.
- Keep track of previous matches so that you can ignore them.
I believe this is what you want:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
The seen
map keeps track of the end position of the most recent match for a token.
Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while
loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1))
.
add a comment |
@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.
I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403314%2fjavascript-map-a-regex-with-multiple-matches-at-the-right-occurrence%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while
loop.
You have to do two things:
- Stop iterating once you found a match.
- Keep track of previous matches so that you can ignore them.
I believe this is what you want:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
The seen
map keeps track of the end position of the most recent match for a token.
Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while
loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1))
.
add a comment |
The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while
loop.
You have to do two things:
- Stop iterating once you found a match.
- Keep track of previous matches so that you can ignore them.
I believe this is what you want:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
The seen
map keeps track of the end position of the most recent match for a token.
Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while
loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1))
.
add a comment |
The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while
loop.
You have to do two things:
- Stop iterating once you found a match.
- Keep track of previous matches so that you can ignore them.
I believe this is what you want:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
The seen
map keeps track of the end position of the most recent match for a token.
Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while
loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1))
.
The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while
loop.
You have to do two things:
- Stop iterating once you found a match.
- Keep track of previous matches so that you can ignore them.
I believe this is what you want:
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
The seen
map keeps track of the end position of the most recent match for a token.
Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while
loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1))
.
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)
edited Nov 21 at 0:55
answered Nov 21 at 0:44
Felix Kling
544k125847902
544k125847902
add a comment |
add a comment |
@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.
I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
add a comment |
@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.
I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
add a comment |
@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.
I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.
I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
class Annotations {
constructor(text) {
if(typeof text !== 'string') return null
const opt = { enumerable: false, configurable: false, writeable: false }
Object.defineProperty(this, 'text', { value: text, ...opt })
Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
}
* matchAll(token) {
if(typeof token === 'string' && this.text.indexOf(token) > -1) {
const expression = new RegExp("\b" + token + "\b", "g")
let match = expression.exec(this.text)
while(match !== null) {
const start = match.index
const end = start + token.length - 1
yield { start, end }
match = expression.exec(this.text)
}
}
}
}
const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")
console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) //
.as-console-wrapper { max-height: 100% !important }
answered Nov 21 at 1:46
Tiny Giant
1
1
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
add a comment |
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403314%2fjavascript-map-a-regex-with-multiple-matches-at-the-right-occurrence%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33
The problem is not the expression is greedy, the problem is that you are looking for every match via the
while
loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch thewhile
loop.– Felix Kling
Nov 21 at 0:33
@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like
tokenMap[ matchIndex ]
, but I'm not sure of it.– loretoparisi
Nov 21 at 0:42
I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49
1
The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51