JavaScript map a regex with multiple matches at the right occurrence

I have an array of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.

By example, given the text

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 70,

    "characterOffsetEnd": 73

  }

This becomes clear running this example:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    var wordStart = match.index;

    var wordEnd = wordStart + word.length - 1;

    item.characterOffsetBegin = wordStart;

    item.characterOffsetEnd = wordEnd;

  }

  return item;

});

console.log(annotations)

where the first occurrence of the token down should be the first matching position:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 6,

    "characterOffsetEnd": 9

  }

So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin and charOffsetEnd hence doing like:

                var newtext = '';

                results.sentences.forEach(sentence => {

                    sentence.tokens.forEach(token => {

                        newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';

                    });

                    newtext += 'n';

                });

edited Nov 21 at 0:57

asked Nov 20 at 23:47

loretoparisi

7,57754771

1

Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33

The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33

@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42

I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49

1

The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51

|
show 3 more comments

By example, given the text

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 70,

    "characterOffsetEnd": 73

  }

This becomes clear running this example:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    var wordStart = match.index;

    var wordEnd = wordStart + word.length - 1;

    item.characterOffsetBegin = wordStart;

    item.characterOffsetEnd = wordEnd;

  }

  return item;

});

console.log(annotations)

where the first occurrence of the token down should be the first matching position:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 6,

    "characterOffsetEnd": 9

  }

                var newtext = '';

                results.sentences.forEach(sentence => {

                    sentence.tokens.forEach(token => {

                        newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';

                    });

                    newtext += 'n';

                });

edited Nov 21 at 0:57

asked Nov 20 at 23:47

loretoparisi

7,57754771

1

Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33

The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33

@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42

I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49

1

The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51

|
show 3 more comments

By example, given the text

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 70,

    "characterOffsetEnd": 73

  }

This becomes clear running this example:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    var wordStart = match.index;

    var wordEnd = wordStart + word.length - 1;

    item.characterOffsetBegin = wordStart;

    item.characterOffsetEnd = wordEnd;

  }

  return item;

});

console.log(annotations)

where the first occurrence of the token down should be the first matching position:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 6,

    "characterOffsetEnd": 9

  }

                var newtext = '';

                results.sentences.forEach(sentence => {

                    sentence.tokens.forEach(token => {

                        newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';

                    });

                    newtext += 'n';

                });

edited Nov 21 at 0:57

asked Nov 20 at 23:47

loretoparisi

7,57754771

By example, given the text

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 70,

    "characterOffsetEnd": 73

  }

This becomes clear running this example:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    var wordStart = match.index;

    var wordEnd = wordStart + word.length - 1;

    item.characterOffsetBegin = wordStart;

    item.characterOffsetEnd = wordEnd;

  }

  return item;

});

console.log(annotations)

where the first occurrence of the token down should be the first matching position:

 {

    "index": 2,

    "word": "down",

    "characterOffsetBegin": 6,

    "characterOffsetEnd": 9

  }

                var newtext = '';

                results.sentences.forEach(sentence => {

                    sentence.tokens.forEach(token => {

                        newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';

                    });

                    newtext += 'n';

                });

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    var wordStart = match.index;

    var wordEnd = wordStart + word.length - 1;

    item.characterOffsetBegin = wordStart;

    item.characterOffsetEnd = wordEnd;

  }

  return item;

});

console.log(annotations)

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    var wordStart = match.index;

    var wordEnd = wordStart + word.length - 1;

    item.characterOffsetBegin = wordStart;

    item.characterOffsetEnd = wordEnd;

  }

  return item;

});

console.log(annotations)

javascript regex text-processing

edited Nov 21 at 0:57

asked Nov 20 at 23:47

loretoparisi

7,57754771

edited Nov 21 at 0:57

asked Nov 20 at 23:47

loretoparisi

7,57754771

edited Nov 21 at 0:57

asked Nov 20 at 23:47

loretoparisi

7,57754771

asked Nov 20 at 23:47

loretoparisi

7,57754771

asked Nov 20 at 23:47

loretoparisi

7,57754771

1

Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33

The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33

@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42

I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49

1

The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51

|
show 3 more comments

1

Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33

The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33

@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42

I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49

1

The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51

Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33

The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33

@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42

I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49

The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51

|
show 3 more comments

2 Answers
2

active

oldest

votes

The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.

You have to do two things:

Stop iterating once you found a match.

Keep track of previous matches so that you can ignore them.

I believe this is what you want:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

The seen map keeps track of the end position of the most recent match for a token.

Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).

edited Nov 21 at 0:55

answered Nov 21 at 0:44

Felix Kling

544k125847902

add a comment |

@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.

I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

answered Nov 21 at 1:46

Tiny Giant

This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403314%2fjavascript-map-a-regex-with-multiple-matches-at-the-right-occurrence%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.

You have to do two things:

Stop iterating once you found a match.

Keep track of previous matches so that you can ignore them.

I believe this is what you want:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

The seen map keeps track of the end position of the most recent match for a token.

edited Nov 21 at 0:55

answered Nov 21 at 0:44

Felix Kling

544k125847902

add a comment |

The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.

You have to do two things:

Stop iterating once you found a match.

Keep track of previous matches so that you can ignore them.

I believe this is what you want:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

The seen map keeps track of the end position of the most recent match for a token.

edited Nov 21 at 0:55

answered Nov 21 at 0:44

Felix Kling

544k125847902

add a comment |

The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.

You have to do two things:

Stop iterating once you found a match.

Keep track of previous matches so that you can ignore them.

I believe this is what you want:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

The seen map keeps track of the end position of the most recent match for a token.

edited Nov 21 at 0:55

answered Nov 21 at 0:44

Felix Kling

544k125847902

The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.

You have to do two things:

Stop iterating once you found a match.

Keep track of previous matches so that you can ignore them.

I believe this is what you want:

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

The seen map keeps track of the end position of the most recent match for a token.

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

var text = "Steve down walks warily down the street downnWith the brim pulled way down low";

var tokens = text.split(/s+/g);

const seen = new Map();



var annotations = tokens.map((word, tokenIndex) => { // for each token

  let item = {

    "index": (tokenIndex + 1),

    "word": word

  }

  var wordRegex = RegExp("\b(" + word + ")\b", "g");

  var match = null;

  while ((match = wordRegex.exec(text)) !== null) {

    if (match.index > (seen.get(word) || -1)) {

      var wordStart = match.index;

      var wordEnd = wordStart + word.length - 1;

      item.characterOffsetBegin = wordStart;

      item.characterOffsetEnd = wordEnd;



      seen.set(word, wordEnd);

      break;

    }

  }

  return item;

});

console.log(annotations)

edited Nov 21 at 0:55

answered Nov 21 at 0:44

Felix Kling

544k125847902

edited Nov 21 at 0:55

answered Nov 21 at 0:44

Felix Kling

544k125847902

answered Nov 21 at 0:44

Felix Kling

544k125847902

answered Nov 21 at 0:44

Felix Kling

544k125847902

add a comment |

@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.

I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

answered Nov 21 at 1:46

Tiny Giant

This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15

add a comment |

@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.

I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

answered Nov 21 at 1:46

Tiny Giant

This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15

add a comment |

@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.

I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

answered Nov 21 at 1:46

Tiny Giant

@Felix's answer covers the cause of your problem, but I'd like to take it a bit further.

I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

class Annotations {

  constructor(text) {

    if(typeof text !== 'string') return null

    const opt = { enumerable: false, configurable: false, writeable: false }

    Object.defineProperty(this, 'text', { value: text, ...opt })

    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })

    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))

  }

  * matchAll(token) {

    if(typeof token === 'string' && this.text.indexOf(token) > -1) {

      const expression = new RegExp("\b" + token + "\b", "g")

      let match = expression.exec(this.text)



      while(match !== null) {

        const start = match.index

        const end = start + token.length - 1

        yield { start, end }

        match = expression.exec(this.text)

      }

    }

  }

}



const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")



console.log(annotations.text)

console.log(annotations.tokens)

console.log(annotations)

console.log(Array.from(annotations.matchAll('foo'))) //

.as-console-wrapper { max-height: 100% !important }

answered Nov 21 at 1:46

Tiny Giant

answered Nov 21 at 1:46

Tiny Giant

answered Nov 21 at 1:46

Tiny Giant

answered Nov 21 at 1:46

Tiny Giant

This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15

add a comment |

This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15

This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
– loretoparisi
Nov 21 at 8:15

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg