Most common repeated strings

DiscussiePurely Programmers

Sluit je aan bij LibraryThing om te posten.

Most common repeated strings

Dit onderwerp is gemarkeerd als "slapend"—het laatste bericht is van meer dan 90 dagen geleden. Je kan het activeren door een een bericht toe te voegen.

1timspalding
jul 25, 2009, 10:19 am

I'm looking for an easy way to identify the most common repeated strings in some code. I've got a big hunk of HTML and JavaScript with many repetitions of very long class, id and function names (eg., "LibraryThing.collections.toggleBookForCollectionFromMenu"). I want to simplify it down by making them smaller.

I'd love a script, preferably in PHP, that showed the longest such strings, so I can approach the issue systematically. I'm guessing this is an easy problem, and that it's baked into algorithms like ZIP. I'm sure there's an elegant algorithm for it.

Anyone have a good solution?

2timspalding
jul 25, 2009, 1:08 pm

I tried it over on StackOverflow. The solution is VERY slow. I've got my own, longer and more annoying one that's a little faster, mostly because it goes word by word.

http://stackoverflow.com/questions/1182208/find-longest-repeating-strings

3Sander314
jul 25, 2009, 1:12 pm

Sounds like a variant on http://en.wikipedia.org/wiki/Longest_repeated_substring_problem
Not very easy to implement in the most general case.

In your case however, you can use something easier as you probably want strings matching a-zA-z0-9\.+ and in that case some kind of brute force regex+hash solution will suffice. Do you have some example input?

4kevinmcguinness
jul 27, 2009, 7:39 pm

There is a solution to a similar problem in Jon Bentley's Programming Pearls, Column 15. He uses a data structure called a suffix array to create a very efficient solution for plain text. You might need to extract what you're interested in from the program text first, but Bentley's solution should work.

Column 15 of the book is available on the books website: http://www.cs.bell-labs.com/cm/cs/pearls/strings.html

-Kevin

5Amtep
Bewerkt: jul 28, 2009, 4:16 pm

Hmm. I suspect that the normal unix tools are up for this job, unless you have gigabytes of code to analyze. We don't need no stinkin' elegance!

This pipeline will find the longest identifiers for you:

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq | perl -n -e 'print length($_)-1, " ", $_' | sort -nr | head -10

(replace all braces with brackets to make it work. Is there any way to turn touchstones off?)

If you want to sort by the product of length and repetitions, in order to find the total bytes wasted on an identifier, you can use uniq -c and do a bit more magic in the perl script like this:

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq -c | perl -na -e 'print $F{0} * length($F{1}), " $F{1}\\n"' | sort -nr | head -10

(again, replace all braces with brackets)

If you have too many files to list them all in $FILES, then grep -r is very nice.

6Sander314
Bewerkt: jul 28, 2009, 4:57 pm

#5: oohhh, that's clever :)

I don't think there's any way to turn off touchstones, though this might work: [whatever]
[] -> [ ]
ETA: works. very awkward in editing though.

Or just link to a pastie: Ruby solution

7bnielsen
jul 29, 2009, 6:06 am

Yes, #5 is clever. I'd use sed to split on blanks and maybe add a test so only strings that occur more than once are included, but that's just bells and whistles.

8Amtep
jul 29, 2009, 6:29 am

uniq has an option for that too: add -d to only print lines that occur more than once. I really love uniq :) And especially the combination of sort | uniq -c | sort -n has come in handy many times.

Where would you use sed? I didn't need it because grep -o already prints each match on a separate line. I didn't know about grep -o before I looked it up for this task, so clue++ for me :)

9bnielsen
jul 29, 2009, 8:11 am

I'd use

sed -e 's/ /\n/g' $FILES | sort ...

rather than

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort ...

so I'd get some slightly different strings, but that's just details. Hmm, uniq -d I didn't know about, so I have a few scripts with uniq -c | grep -v ' 1 '

clue++ for me too.

10Makis
aug 3, 2009, 9:55 am

I would personally get a programming IDE with refactoring capability to do this job. Eclipse seems to have a plugin for JavaScript refactoring:
http://www.hsr.ch/uploads/tx_icscrm/i_ba_2008_bachmann_pfister.pdf