Quantcast
Channel: Data Preparation & Blending discussions
Viewing all articles
Browse latest Browse all 4999

Thanksgiving Challenge: nGram Generator requiring a loop

$
0
0

Thought I'd challenge myself this Thanksgiving, but may have bit off too much to swallow. Any help appreciated.

 

In the attached workflow, I generate a number of possible 5grams (combinations of five words) and count them to find out which are most common. The macro then goes through and picks the high-frequency nGrams and replaces the words with them.  Unfortunately, this process will require multiple runs through the data and a creative approach to downgrading ngram priorities. Issues I encounter:

 

I Need a good way to loop through the process until all possible nGrams have been selected and placed into the list of words. For example, if the first text in the dataset was "I think therefore I am amazing at thinking deep thoughts and love." This sentence would generate several potential nGrams:

1. I_think_therefore_I_am  

2. think_therefore_I_am_amazing 

3. therefore_I_am_amazing_at

4. I_am_amazing_at_thinking

5. am_amazing_at_thinking_deep

6. amazing_at_thinking_deep_thoughts

7. at_thinking_deep_thoughts_and

8. thinking_deep_thoughts_and_love

 

 

Any thoughts on how to set the macro up with a loop that knows when it has assigned all nGrams that can be assigned? For example, in the above set, nGram 4 might have a higher priority than the first three nGram candidates, which means that the first round through proceed down to 4, but here discover that nGram 6 has an even higher priority, and accept this as the only nGram coded in this round. Before finishing, the macro will remove nGram candidates 7 and 8 since they can't co-exist with 6. It should also get rid of candidates 2-5 since they all use a word that has now been taken over by nGram 6 (amazing). In the second round, nGram 1 should be selected.

 

Any thoughts welcome. I've reviewed the macro videos, but the examples are a bit contrived and had a fixed number of loops rather than the more dynamic need I have here.

 

Kai :-)

 


Viewing all articles
Browse latest Browse all 4999

Trending Articles