Dit onderwerp is gemarkeerd als "slapend"—het laatste bericht is van meer dan 90 dagen geleden. Je kan het activeren door een een bericht toe te voegen.
1timspalding
I'm looking for an easy way to identify the most common repeated strings in some code. I've got a big hunk of HTML and JavaScript with many repetitions of very long class, id and function names (eg., "LibraryThing.collections.toggleBookForCollectionFromMenu"). I want to simplify it down by making them smaller.
I'd love a script, preferably in PHP, that showed the longest such strings, so I can approach the issue systematically. I'm guessing this is an easy problem, and that it's baked into algorithms like ZIP. I'm sure there's an elegant algorithm for it.
Anyone have a good solution?
I'd love a script, preferably in PHP, that showed the longest such strings, so I can approach the issue systematically. I'm guessing this is an easy problem, and that it's baked into algorithms like ZIP. I'm sure there's an elegant algorithm for it.
Anyone have a good solution?
2timspalding
I tried it over on StackOverflow. The solution is VERY slow. I've got my own, longer and more annoying one that's a little faster, mostly because it goes word by word.
http://stackoverflow.com/questions/1182208/find-longest-repeating-strings
http://stackoverflow.com/questions/1182208/find-longest-repeating-strings
3Sander314
Sounds like a variant on http://en.wikipedia.org/wiki/Longest_repeated_substring_problem
Not very easy to implement in the most general case.
In your case however, you can use something easier as you probably want strings matching a-zA-z0-9\.+ and in that case some kind of brute force regex+hash solution will suffice. Do you have some example input?
Not very easy to implement in the most general case.
In your case however, you can use something easier as you probably want strings matching a-zA-z0-9\.+ and in that case some kind of brute force regex+hash solution will suffice. Do you have some example input?
4kevinmcguinness
There is a solution to a similar problem in Jon Bentley's Programming Pearls, Column 15. He uses a data structure called a suffix array to create a very efficient solution for plain text. You might need to extract what you're interested in from the program text first, but Bentley's solution should work.
Column 15 of the book is available on the books website: http://www.cs.bell-labs.com/cm/cs/pearls/strings.html
-Kevin
Column 15 of the book is available on the books website: http://www.cs.bell-labs.com/cm/cs/pearls/strings.html
-Kevin
5Amtep
Hmm. I suspect that the normal unix tools are up for this job, unless you have gigabytes of code to analyze. We don't need no stinkin' elegance!
This pipeline will find the longest identifiers for you:
grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq | perl -n -e 'print length($_)-1, " ", $_' | sort -nr | head -10
(replace all braces with brackets to make it work. Is there any way to turn touchstones off?)
If you want to sort by the product of length and repetitions, in order to find the total bytes wasted on an identifier, you can use uniq -c and do a bit more magic in the perl script like this:
grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq -c | perl -na -e 'print $F{0} * length($F{1}), " $F{1}\\n"' | sort -nr | head -10
(again, replace all braces with brackets)
If you have too many files to list them all in $FILES, then grep -r is very nice.
This pipeline will find the longest identifiers for you:
grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq | perl -n -e 'print length($_)-1, " ", $_' | sort -nr | head -10
(replace all braces with brackets to make it work. Is there any way to turn touchstones off?)
If you want to sort by the product of length and repetitions, in order to find the total bytes wasted on an identifier, you can use uniq -c and do a bit more magic in the perl script like this:
grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq -c | perl -na -e 'print $F{0} * length($F{1}), " $F{1}\\n"' | sort -nr | head -10
(again, replace all braces with brackets)
If you have too many files to list them all in $FILES, then grep -r is very nice.
6Sander314
#5: oohhh, that's clever :)
I don't think there's any way to turn off touchstones, though this might work: [whatever]
[] -> [ ]
ETA: works. very awkward in editing though.
Or just link to a pastie: Ruby solution
I don't think there's any way to turn off touchstones, though this might work: [whatever]
[] -> [ ]
ETA: works. very awkward in editing though.
Or just link to a pastie: Ruby solution
7bnielsen
Yes, #5 is clever. I'd use sed to split on blanks and maybe add a test so only strings that occur more than once are included, but that's just bells and whistles.
8Amtep
uniq has an option for that too: add -d to only print lines that occur more than once. I really love uniq :) And especially the combination of sort | uniq -c | sort -n has come in handy many times.
Where would you use sed? I didn't need it because grep -o already prints each match on a separate line. I didn't know about grep -o before I looked it up for this task, so clue++ for me :)
Where would you use sed? I didn't need it because grep -o already prints each match on a separate line. I didn't know about grep -o before I looked it up for this task, so clue++ for me :)
9bnielsen
I'd use
sed -e 's/ /\n/g' $FILES | sort ...
rather than
grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort ...
so I'd get some slightly different strings, but that's just details. Hmm, uniq -d I didn't know about, so I have a few scripts with uniq -c | grep -v ' 1 '
clue++ for me too.
sed -e 's/ /\n/g' $FILES | sort ...
rather than
grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort ...
so I'd get some slightly different strings, but that's just details. Hmm, uniq -d I didn't know about, so I have a few scripts with uniq -c | grep -v ' 1 '
clue++ for me too.
10Makis
I would personally get a programming IDE with refactoring capability to do this job. Eclipse seems to have a plugin for JavaScript refactoring:
http://www.hsr.ch/uploads/tx_icscrm/i_ba_2008_bachmann_pfister.pdf
http://www.hsr.ch/uploads/tx_icscrm/i_ba_2008_bachmann_pfister.pdf