JavaScript map a regex with multiple matches at the right occurrence












1














I have an array of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.



By example, given the text



var text = "Steve down walks warily down the street downnWith the brim pulled way down low";


the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}


This becomes clear running this example:






var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)





where the first occurrence of the token down should be the first matching position:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}


So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin and charOffsetEnd hence doing like:



                var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});









share|improve this question




















  • 1




    Looks like it's not straight-forward
    – radarbob
    Nov 21 at 0:33










  • The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
    – Felix Kling
    Nov 21 at 0:33












  • @FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
    – loretoparisi
    Nov 21 at 0:42










  • I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
    – Felix Kling
    Nov 21 at 0:49








  • 1




    The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
    – Felix Kling
    Nov 21 at 0:51


















1














I have an array of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.



By example, given the text



var text = "Steve down walks warily down the street downnWith the brim pulled way down low";


the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}


This becomes clear running this example:






var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)





where the first occurrence of the token down should be the first matching position:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}


So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin and charOffsetEnd hence doing like:



                var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});









share|improve this question




















  • 1




    Looks like it's not straight-forward
    – radarbob
    Nov 21 at 0:33










  • The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
    – Felix Kling
    Nov 21 at 0:33












  • @FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
    – loretoparisi
    Nov 21 at 0:42










  • I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
    – Felix Kling
    Nov 21 at 0:49








  • 1




    The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
    – Felix Kling
    Nov 21 at 0:51
















1












1








1


1





I have an array of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.



By example, given the text



var text = "Steve down walks warily down the street downnWith the brim pulled way down low";


the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}


This becomes clear running this example:






var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)





where the first occurrence of the token down should be the first matching position:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}


So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin and charOffsetEnd hence doing like:



                var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});









share|improve this question















I have an array of tokens to map, and a regex that gets the begin and end positions of each token within an input sentence. This works ok when the token has one occurrence. When the token has multiple occurrences, the greedy Regex will get all the matched positions of the token in the text, so the resulting position for the i-th token occurrence will be mapped by the last found position.



By example, given the text



var text = "Steve down walks warily down the street downnWith the brim pulled way down low";


the first occurrence of the token down is mapped to the last position in the text matched by the RegExp, hence I have:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 70,
"characterOffsetEnd": 73
}


This becomes clear running this example:






var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)





where the first occurrence of the token down should be the first matching position:



 {
"index": 2,
"word": "down",
"characterOffsetBegin": 6,
"characterOffsetEnd": 9
}


So given that I have mapped the tokens position for each occurrence of the token in the text i.e. first occurrence of down with the first match, the 2nd with the second match etc. I can reconstruct the text accordingly with the charOffsetBegin and charOffsetEnd hence doing like:



                var newtext = '';
results.sentences.forEach(sentence => {
sentence.tokens.forEach(token => {
newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
});
newtext += 'n';
});





var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)





var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
}
return item;
});
console.log(annotations)






javascript regex text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 at 0:57

























asked Nov 20 at 23:47









loretoparisi

7,57754771




7,57754771








  • 1




    Looks like it's not straight-forward
    – radarbob
    Nov 21 at 0:33










  • The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
    – Felix Kling
    Nov 21 at 0:33












  • @FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
    – loretoparisi
    Nov 21 at 0:42










  • I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
    – Felix Kling
    Nov 21 at 0:49








  • 1




    The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
    – Felix Kling
    Nov 21 at 0:51
















  • 1




    Looks like it's not straight-forward
    – radarbob
    Nov 21 at 0:33










  • The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
    – Felix Kling
    Nov 21 at 0:33












  • @FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
    – loretoparisi
    Nov 21 at 0:42










  • I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
    – Felix Kling
    Nov 21 at 0:49








  • 1




    The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
    – Felix Kling
    Nov 21 at 0:51










1




1




Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33




Looks like it's not straight-forward
– radarbob
Nov 21 at 0:33












The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33






The problem is not the expression is greedy, the problem is that you are looking for every match via the while loop, so the last one wins. Do you only want to search for the first match? One solution would be to keep track of the previous match position for a token, and ditch the while loop.
– Felix Kling
Nov 21 at 0:33














@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42




@FelixKling yes this make senses, but I did not find a way to do it. The previous match also should be by token, i.e. it should be like a token map of matches like tokenMap[ matchIndex ], but I'm not sure of it.
– loretoparisi
Nov 21 at 0:42












I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49






I had posted an answer, but it's actually unclear to me what the desired result really is. How I understand it: You want to find the position of the occurrence of a token. Your example is a bit confusing though because not only does a token occur multiple times in the input text, you also have duplicate tokens in your token list! So, if tokens are unique, what should the result be if it occurs multiple times in the input text? If tokens are not unique, what should the result be if duplicate tokens only occur once in the input text? And finally, what if duplicate tokens occur multiple times?
– Felix Kling
Nov 21 at 0:49






1




1




The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51






The way you describe it it seems like the result should be: Find the position for first occurrence in the input text. If there are duplicate tokens, the second (,third, forth,...) "copy" of the token should find the second, (,third, forth, ...) occurrence in the input text. Is that correct?
– Felix Kling
Nov 21 at 0:51














2 Answers
2






active

oldest

votes


















3














The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.



You have to do two things:




  • Stop iterating once you found a match.

  • Keep track of previous matches so that you can ignore them.


I believe this is what you want:






var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
var tokens = text.split(/s+/g);
const seen = new Map();

var annotations = tokens.map((word, tokenIndex) => { // for each token
let item = {
"index": (tokenIndex + 1),
"word": word
}
var wordRegex = RegExp("\b(" + word + ")\b", "g");
var match = null;
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(word) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + word.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;

seen.set(word, wordEnd);
break;
}
}
return item;
});
console.log(annotations)





The seen map keeps track of the end position of the most recent match for a token.



Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).






share|improve this answer































    1














    @Felix's answer covers the cause of your problem, but I'd like to take it a bit further.



    I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.






    class Annotations {
    constructor(text) {
    if(typeof text !== 'string') return null
    const opt = { enumerable: false, configurable: false, writeable: false }
    Object.defineProperty(this, 'text', { value: text, ...opt })
    Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
    }
    * matchAll(token) {
    if(typeof token === 'string' && this.text.indexOf(token) > -1) {
    const expression = new RegExp("\b" + token + "\b", "g")
    let match = expression.exec(this.text)

    while(match !== null) {
    const start = match.index
    const end = start + token.length - 1
    yield { start, end }
    match = expression.exec(this.text)
    }
    }
    }
    }

    const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

    console.log(annotations.text)
    console.log(annotations.tokens)
    console.log(annotations)
    console.log(Array.from(annotations.matchAll('foo'))) //

    .as-console-wrapper { max-height: 100% !important }








    share|improve this answer





















    • This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
      – loretoparisi
      Nov 21 at 8:15











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403314%2fjavascript-map-a-regex-with-multiple-matches-at-the-right-occurrence%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.



    You have to do two things:




    • Stop iterating once you found a match.

    • Keep track of previous matches so that you can ignore them.


    I believe this is what you want:






    var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
    var tokens = text.split(/s+/g);
    const seen = new Map();

    var annotations = tokens.map((word, tokenIndex) => { // for each token
    let item = {
    "index": (tokenIndex + 1),
    "word": word
    }
    var wordRegex = RegExp("\b(" + word + ")\b", "g");
    var match = null;
    while ((match = wordRegex.exec(text)) !== null) {
    if (match.index > (seen.get(word) || -1)) {
    var wordStart = match.index;
    var wordEnd = wordStart + word.length - 1;
    item.characterOffsetBegin = wordStart;
    item.characterOffsetEnd = wordEnd;

    seen.set(word, wordEnd);
    break;
    }
    }
    return item;
    });
    console.log(annotations)





    The seen map keeps track of the end position of the most recent match for a token.



    Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).






    share|improve this answer




























      3














      The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.



      You have to do two things:




      • Stop iterating once you found a match.

      • Keep track of previous matches so that you can ignore them.


      I believe this is what you want:






      var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
      var tokens = text.split(/s+/g);
      const seen = new Map();

      var annotations = tokens.map((word, tokenIndex) => { // for each token
      let item = {
      "index": (tokenIndex + 1),
      "word": word
      }
      var wordRegex = RegExp("\b(" + word + ")\b", "g");
      var match = null;
      while ((match = wordRegex.exec(text)) !== null) {
      if (match.index > (seen.get(word) || -1)) {
      var wordStart = match.index;
      var wordEnd = wordStart + word.length - 1;
      item.characterOffsetBegin = wordStart;
      item.characterOffsetEnd = wordEnd;

      seen.set(word, wordEnd);
      break;
      }
      }
      return item;
      });
      console.log(annotations)





      The seen map keeps track of the end position of the most recent match for a token.



      Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).






      share|improve this answer


























        3












        3








        3






        The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.



        You have to do two things:




        • Stop iterating once you found a match.

        • Keep track of previous matches so that you can ignore them.


        I believe this is what you want:






        var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
        var tokens = text.split(/s+/g);
        const seen = new Map();

        var annotations = tokens.map((word, tokenIndex) => { // for each token
        let item = {
        "index": (tokenIndex + 1),
        "word": word
        }
        var wordRegex = RegExp("\b(" + word + ")\b", "g");
        var match = null;
        while ((match = wordRegex.exec(text)) !== null) {
        if (match.index > (seen.get(word) || -1)) {
        var wordStart = match.index;
        var wordEnd = wordStart + word.length - 1;
        item.characterOffsetBegin = wordStart;
        item.characterOffsetEnd = wordEnd;

        seen.set(word, wordEnd);
        break;
        }
        }
        return item;
        });
        console.log(annotations)





        The seen map keeps track of the end position of the most recent match for a token.



        Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).






        share|improve this answer














        The problem is not that the expression is greedy, but that you are looking for every match of a token inside the input string with your while loop.



        You have to do two things:




        • Stop iterating once you found a match.

        • Keep track of previous matches so that you can ignore them.


        I believe this is what you want:






        var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
        var tokens = text.split(/s+/g);
        const seen = new Map();

        var annotations = tokens.map((word, tokenIndex) => { // for each token
        let item = {
        "index": (tokenIndex + 1),
        "word": word
        }
        var wordRegex = RegExp("\b(" + word + ")\b", "g");
        var match = null;
        while ((match = wordRegex.exec(text)) !== null) {
        if (match.index > (seen.get(word) || -1)) {
        var wordStart = match.index;
        var wordEnd = wordStart + word.length - 1;
        item.characterOffsetBegin = wordStart;
        item.characterOffsetEnd = wordEnd;

        seen.set(word, wordEnd);
        break;
        }
        }
        return item;
        });
        console.log(annotations)





        The seen map keeps track of the end position of the most recent match for a token.



        Since it isn't possible to tell the regex engine to ignore everything before a specific position, we are still using a while loop, but are ignoring any matches that happen before the previous match, with if (match.index > (seen.get(word) || -1)).






        var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
        var tokens = text.split(/s+/g);
        const seen = new Map();

        var annotations = tokens.map((word, tokenIndex) => { // for each token
        let item = {
        "index": (tokenIndex + 1),
        "word": word
        }
        var wordRegex = RegExp("\b(" + word + ")\b", "g");
        var match = null;
        while ((match = wordRegex.exec(text)) !== null) {
        if (match.index > (seen.get(word) || -1)) {
        var wordStart = match.index;
        var wordEnd = wordStart + word.length - 1;
        item.characterOffsetBegin = wordStart;
        item.characterOffsetEnd = wordEnd;

        seen.set(word, wordEnd);
        break;
        }
        }
        return item;
        });
        console.log(annotations)





        var text = "Steve down walks warily down the street downnWith the brim pulled way down low";
        var tokens = text.split(/s+/g);
        const seen = new Map();

        var annotations = tokens.map((word, tokenIndex) => { // for each token
        let item = {
        "index": (tokenIndex + 1),
        "word": word
        }
        var wordRegex = RegExp("\b(" + word + ")\b", "g");
        var match = null;
        while ((match = wordRegex.exec(text)) !== null) {
        if (match.index > (seen.get(word) || -1)) {
        var wordStart = match.index;
        var wordEnd = wordStart + word.length - 1;
        item.characterOffsetBegin = wordStart;
        item.characterOffsetEnd = wordEnd;

        seen.set(word, wordEnd);
        break;
        }
        }
        return item;
        });
        console.log(annotations)






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 21 at 0:55

























        answered Nov 21 at 0:44









        Felix Kling

        544k125847902




        544k125847902

























            1














            @Felix's answer covers the cause of your problem, but I'd like to take it a bit further.



            I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.






            class Annotations {
            constructor(text) {
            if(typeof text !== 'string') return null
            const opt = { enumerable: false, configurable: false, writeable: false }
            Object.defineProperty(this, 'text', { value: text, ...opt })
            Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
            for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
            }
            * matchAll(token) {
            if(typeof token === 'string' && this.text.indexOf(token) > -1) {
            const expression = new RegExp("\b" + token + "\b", "g")
            let match = expression.exec(this.text)

            while(match !== null) {
            const start = match.index
            const end = start + token.length - 1
            yield { start, end }
            match = expression.exec(this.text)
            }
            }
            }
            }

            const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

            console.log(annotations.text)
            console.log(annotations.tokens)
            console.log(annotations)
            console.log(Array.from(annotations.matchAll('foo'))) //

            .as-console-wrapper { max-height: 100% !important }








            share|improve this answer





















            • This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
              – loretoparisi
              Nov 21 at 8:15
















            1














            @Felix's answer covers the cause of your problem, but I'd like to take it a bit further.



            I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.






            class Annotations {
            constructor(text) {
            if(typeof text !== 'string') return null
            const opt = { enumerable: false, configurable: false, writeable: false }
            Object.defineProperty(this, 'text', { value: text, ...opt })
            Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
            for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
            }
            * matchAll(token) {
            if(typeof token === 'string' && this.text.indexOf(token) > -1) {
            const expression = new RegExp("\b" + token + "\b", "g")
            let match = expression.exec(this.text)

            while(match !== null) {
            const start = match.index
            const end = start + token.length - 1
            yield { start, end }
            match = expression.exec(this.text)
            }
            }
            }
            }

            const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

            console.log(annotations.text)
            console.log(annotations.tokens)
            console.log(annotations)
            console.log(Array.from(annotations.matchAll('foo'))) //

            .as-console-wrapper { max-height: 100% !important }








            share|improve this answer





















            • This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
              – loretoparisi
              Nov 21 at 8:15














            1












            1








            1






            @Felix's answer covers the cause of your problem, but I'd like to take it a bit further.



            I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.






            class Annotations {
            constructor(text) {
            if(typeof text !== 'string') return null
            const opt = { enumerable: false, configurable: false, writeable: false }
            Object.defineProperty(this, 'text', { value: text, ...opt })
            Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
            for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
            }
            * matchAll(token) {
            if(typeof token === 'string' && this.text.indexOf(token) > -1) {
            const expression = new RegExp("\b" + token + "\b", "g")
            let match = expression.exec(this.text)

            while(match !== null) {
            const start = match.index
            const end = start + token.length - 1
            yield { start, end }
            match = expression.exec(this.text)
            }
            }
            }
            }

            const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

            console.log(annotations.text)
            console.log(annotations.tokens)
            console.log(annotations)
            console.log(Array.from(annotations.matchAll('foo'))) //

            .as-console-wrapper { max-height: 100% !important }








            share|improve this answer












            @Felix's answer covers the cause of your problem, but I'd like to take it a bit further.



            I would put everything in a class (or a constructor) to keep it contained, and separate the logic for extracting the matches from the text for each token from the iteration of the tokens.






            class Annotations {
            constructor(text) {
            if(typeof text !== 'string') return null
            const opt = { enumerable: false, configurable: false, writeable: false }
            Object.defineProperty(this, 'text', { value: text, ...opt })
            Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
            for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
            }
            * matchAll(token) {
            if(typeof token === 'string' && this.text.indexOf(token) > -1) {
            const expression = new RegExp("\b" + token + "\b", "g")
            let match = expression.exec(this.text)

            while(match !== null) {
            const start = match.index
            const end = start + token.length - 1
            yield { start, end }
            match = expression.exec(this.text)
            }
            }
            }
            }

            const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

            console.log(annotations.text)
            console.log(annotations.tokens)
            console.log(annotations)
            console.log(Array.from(annotations.matchAll('foo'))) //

            .as-console-wrapper { max-height: 100% !important }








            class Annotations {
            constructor(text) {
            if(typeof text !== 'string') return null
            const opt = { enumerable: false, configurable: false, writeable: false }
            Object.defineProperty(this, 'text', { value: text, ...opt })
            Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
            for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
            }
            * matchAll(token) {
            if(typeof token === 'string' && this.text.indexOf(token) > -1) {
            const expression = new RegExp("\b" + token + "\b", "g")
            let match = expression.exec(this.text)

            while(match !== null) {
            const start = match.index
            const end = start + token.length - 1
            yield { start, end }
            match = expression.exec(this.text)
            }
            }
            }
            }

            const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

            console.log(annotations.text)
            console.log(annotations.tokens)
            console.log(annotations)
            console.log(Array.from(annotations.matchAll('foo'))) //

            .as-console-wrapper { max-height: 100% !important }





            class Annotations {
            constructor(text) {
            if(typeof text !== 'string') return null
            const opt = { enumerable: false, configurable: false, writeable: false }
            Object.defineProperty(this, 'text', { value: text, ...opt })
            Object.defineProperty(this, 'tokens', { value: text.split(/s+/g), ...opt })
            for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
            }
            * matchAll(token) {
            if(typeof token === 'string' && this.text.indexOf(token) > -1) {
            const expression = new RegExp("\b" + token + "\b", "g")
            let match = expression.exec(this.text)

            while(match !== null) {
            const start = match.index
            const end = start + token.length - 1
            yield { start, end }
            match = expression.exec(this.text)
            }
            }
            }
            }

            const annotations = new Annotations("Steve down walks warily down the street downnWith the brim pulled way down low")

            console.log(annotations.text)
            console.log(annotations.tokens)
            console.log(annotations)
            console.log(Array.from(annotations.matchAll('foo'))) //

            .as-console-wrapper { max-height: 100% !important }






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 21 at 1:46









            Tiny Giant

            1




            1












            • This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
              – loretoparisi
              Nov 21 at 8:15


















            • This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
              – loretoparisi
              Nov 21 at 8:15
















            This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
            – loretoparisi
            Nov 21 at 8:15




            This makes a lot of sense as a library utility, thank you! Considered the thread mentioned in the first comment above about how this issue could be tricky, I would recommend to add your version to std.js too!
            – loretoparisi
            Nov 21 at 8:15


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403314%2fjavascript-map-a-regex-with-multiple-matches-at-the-right-occurrence%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen