Spell checker (closed)
Hello,
EDIT - March 15, 2009 - project has now been transformed into a library. See the library section of this forum.
EDIT/UPDATE: March 5, 2009 - see below and post 11
UPDATE: see notes below
I've been working on this project off and on for the last 10 days. This is my first little "serious" project with Basic4PPC, Windows Mobile and PPC - all in order to learn something new.
I got the idea to this project while following the other thread "Personal Pocket PC Wiki" since I noted a spell checker was in the "todo/wish-list".
Basically, a spell checker customarily consists of two parts:
1) A set of routines for scanning text and extracting words, and
2) An algorithm for comparing the extracted words against a known list of correctly spelled words (i.e., the dictionary). //Source: Wikipedia
This is more or less what I have implemented in the source-code that I am attaching to this post although I have also added the possibility to ignore a misspelled word and to replace a misspelled word with an own word (which can also be saved to the dictionary).
However, what mentioned above is only a "half" spell checker since these days spell checkers also suggest replacements/corrections for misspelled words (among other things such as synonyms and grammar-hints). Said suggestions can be proposed by the program based upon various techniques:
-phonetic algorithms such as "Soundex" among others.
-wordlists containing common misspelled words and letters commonly inverted
-algorithms like "edit distance" which measure the amount of difference between two sequences. A famous one is the "Levenshtein distance"
-and other techniques
I am currently working on the next version of this spell checker where some of above mentioned techniques have already been implemented. I have also compiled a library for using Soundex and the Levenshtein distance mentioned above. I will post this library shortly here on the forum. When I think the next version of the spell checker is "ready for testing", I will post the source-code here in this thread.
I am aware of the fact that (at least) WM6 already offers spelling-suggestions and a spell checker if Word (Office) has been installed but still I liked this idea as a project, so I said to myself "What the heck - let's try"! In any case, as far as I know, only the dictionary corresponding to the language of WM6 is being installed so if you want to spell check words in other languages you cannot do so.
Please bear in mind that I am hobby-programmer and I am sure my code can be improved in many ways (in terms of speed, efficiency and clarity) but still I am quite pleased with the same.
So far, I have learned two important lessons:
1) "to keep your eyes open" and expect to find all kind of errors while debugging. The other day, I "lost" 3 hours of precious time because of a stupid error I had made in the code. I was so frustrated that I wanted to send a PM (I never did) to Agraham and tell him that his Collection-library was full of bugs but of course the error was mine. I didn't notice/remember a string-conversion (strToUpper) I had made in my code
2) to bear in mind that I am developing for a PPC, Smartphone, handheld (or whatever they are called) and that there is a huge difference in speed and available hardware-resources compared to a normal "desktop-PC". One of the critical parts of the first version posted here is the loading of the dictionary. The dictionary attached has nearly 70000 words and in my first "internal" version it loaded very fast when running on my PC while on my PPC (a Samsung i780 which has a powerful processor with lots of free memory) it took AGES!! This was due to the fact that I read the dictionary line by line instead of using "FileReadToEnd". Another critical part is when I compare the words to spell-check against the words in the dictionary. First, I made a "For-Next iteration" and it took ages. Then I saw Agraham's Collection-library with the option to add IndexOf-search to arrays and everything became so much faster!!! This time, I wanted to send him a PM to praise him for the libraries he is furnishing us with (this is of course valid for all others writing libraries for the Basic4PPC-comunity).
The version posted here works quite well and is being loaded and executed quickly both on my PC and on my PPC, especially the compiled version. The dictionary included is an English one but you can find other free dictionaries for other languages on the Internet (have a look for instance at: Word lists - download wordlists for free - language dictionary translation cracking passwords - despite the description of the web-page it actually has really good dictionaries online - there are other sites as well - try with Google). Please rename the dictionary to "Eng_Dict.txt" and keep it in the application-directory or you may of course adjust the name in the source-code. You also need Agraham's excellent Collection-library.
Please feel free to use or modify the code posted here for use in your own applications. If you modify the code to the better or add useful options, please then post the modified/added part of the code here so all of us can benefit from your improvements (this is also why I wrote this post in the Open-Source part of the forum).
NEW VERSION WITH SUGGESTIONS:
I have attached to this post a new version of the spell-checker - this time with suggestions. For the time being, I have not included the source and only the executable for windows (desktop). Included in the zip-file, you will also find an English wordlist (other languages will be added later) and some support files. You also need Agraham's StringsEx-library (included). All these files must be in the same directory together with the executable in order to run the program. After the first run of the program, an additional file will be created in the same directory.
This version is using Soundex for creation of suggestions but not only - it also takes into consideration common typing-errors (based on functions called "Near Miss Strategy" and introduced by one of the first spell-checkers on the market, namely Ispell for UNIX and with its roots dating back to 1971). I have included a text-file called Test.txt which you can load for testing purposes. In overall, I think the spell-checker performs quite well, both in terms of suggesting valid replacements and in terms of speed.
However what regards performance on the PPC - that's another story I am currently trying out different ways of coding in order to boost performance on my PPC but I still have a lot of testing to do. I will in the next days post the source-code, structured differently, and maybe someone can give me a hand to speed up the performance on the PPC.
Please remember that the user-interface is only present to facilitate the testing of the code and therefore I am not putting any effort into cosmetic issues. Once the code has been finished, the idea is to include it in another project that may require spell-checking. I might one day make a library out of it.
Please let me know what you think.
Update - March 5, 2009:
This project has now been transformed into a library.
Inputs, suggestions for improvements, bug-reports, test/speed-results and small talk are welcomed!!
rgds,
moster67
Hello,
EDIT - March 15, 2009 - project has now been transformed into a library. See the library section of this forum.
EDIT/UPDATE: March 5, 2009 - see below and post 11
UPDATE: see notes below
I've been working on this project off and on for the last 10 days. This is my first little "serious" project with Basic4PPC, Windows Mobile and PPC - all in order to learn something new.
I got the idea to this project while following the other thread "Personal Pocket PC Wiki" since I noted a spell checker was in the "todo/wish-list".
Basically, a spell checker customarily consists of two parts:
1) A set of routines for scanning text and extracting words, and
2) An algorithm for comparing the extracted words against a known list of correctly spelled words (i.e., the dictionary). //Source: Wikipedia
This is more or less what I have implemented in the source-code that I am attaching to this post although I have also added the possibility to ignore a misspelled word and to replace a misspelled word with an own word (which can also be saved to the dictionary).
However, what mentioned above is only a "half" spell checker since these days spell checkers also suggest replacements/corrections for misspelled words (among other things such as synonyms and grammar-hints). Said suggestions can be proposed by the program based upon various techniques:
-phonetic algorithms such as "Soundex" among others.
-wordlists containing common misspelled words and letters commonly inverted
-algorithms like "edit distance" which measure the amount of difference between two sequences. A famous one is the "Levenshtein distance"
-and other techniques
I am currently working on the next version of this spell checker where some of above mentioned techniques have already been implemented. I have also compiled a library for using Soundex and the Levenshtein distance mentioned above. I will post this library shortly here on the forum. When I think the next version of the spell checker is "ready for testing", I will post the source-code here in this thread.
I am aware of the fact that (at least) WM6 already offers spelling-suggestions and a spell checker if Word (Office) has been installed but still I liked this idea as a project, so I said to myself "What the heck - let's try"! In any case, as far as I know, only the dictionary corresponding to the language of WM6 is being installed so if you want to spell check words in other languages you cannot do so.
Please bear in mind that I am hobby-programmer and I am sure my code can be improved in many ways (in terms of speed, efficiency and clarity) but still I am quite pleased with the same.
So far, I have learned two important lessons:
1) "to keep your eyes open" and expect to find all kind of errors while debugging. The other day, I "lost" 3 hours of precious time because of a stupid error I had made in the code. I was so frustrated that I wanted to send a PM (I never did) to Agraham and tell him that his Collection-library was full of bugs but of course the error was mine. I didn't notice/remember a string-conversion (strToUpper) I had made in my code
2) to bear in mind that I am developing for a PPC, Smartphone, handheld (or whatever they are called) and that there is a huge difference in speed and available hardware-resources compared to a normal "desktop-PC". One of the critical parts of the first version posted here is the loading of the dictionary. The dictionary attached has nearly 70000 words and in my first "internal" version it loaded very fast when running on my PC while on my PPC (a Samsung i780 which has a powerful processor with lots of free memory) it took AGES!! This was due to the fact that I read the dictionary line by line instead of using "FileReadToEnd". Another critical part is when I compare the words to spell-check against the words in the dictionary. First, I made a "For-Next iteration" and it took ages. Then I saw Agraham's Collection-library with the option to add IndexOf-search to arrays and everything became so much faster!!! This time, I wanted to send him a PM to praise him for the libraries he is furnishing us with (this is of course valid for all others writing libraries for the Basic4PPC-comunity).
The version posted here works quite well and is being loaded and executed quickly both on my PC and on my PPC, especially the compiled version. The dictionary included is an English one but you can find other free dictionaries for other languages on the Internet (have a look for instance at: Word lists - download wordlists for free - language dictionary translation cracking passwords - despite the description of the web-page it actually has really good dictionaries online - there are other sites as well - try with Google). Please rename the dictionary to "Eng_Dict.txt" and keep it in the application-directory or you may of course adjust the name in the source-code. You also need Agraham's excellent Collection-library.
Please feel free to use or modify the code posted here for use in your own applications. If you modify the code to the better or add useful options, please then post the modified/added part of the code here so all of us can benefit from your improvements (this is also why I wrote this post in the Open-Source part of the forum).
NEW VERSION WITH SUGGESTIONS:
I have attached to this post a new version of the spell-checker - this time with suggestions. For the time being, I have not included the source and only the executable for windows (desktop). Included in the zip-file, you will also find an English wordlist (other languages will be added later) and some support files. You also need Agraham's StringsEx-library (included). All these files must be in the same directory together with the executable in order to run the program. After the first run of the program, an additional file will be created in the same directory.
This version is using Soundex for creation of suggestions but not only - it also takes into consideration common typing-errors (based on functions called "Near Miss Strategy" and introduced by one of the first spell-checkers on the market, namely Ispell for UNIX and with its roots dating back to 1971). I have included a text-file called Test.txt which you can load for testing purposes. In overall, I think the spell-checker performs quite well, both in terms of suggesting valid replacements and in terms of speed.
However what regards performance on the PPC - that's another story I am currently trying out different ways of coding in order to boost performance on my PPC but I still have a lot of testing to do. I will in the next days post the source-code, structured differently, and maybe someone can give me a hand to speed up the performance on the PPC.
Please remember that the user-interface is only present to facilitate the testing of the code and therefore I am not putting any effort into cosmetic issues. Once the code has been finished, the idea is to include it in another project that may require spell-checking. I might one day make a library out of it.
Please let me know what you think.
Update - March 5, 2009:
This project has now been transformed into a library.
Inputs, suggestions for improvements, bug-reports, test/speed-results and small talk are welcomed!!
rgds,
moster67
Last edited: