1B1. | Directories and Files used for aspell (at UVSI) |
.aspell.conf personal configuration file example |
1C1. | Operating Instructions for aspell (example for UVSI) |
1D1. | Master plan to create a technical word-list for aspell |
1E1. | Extracting all words from the aspell dictionaries |
- converting to a text file for matching with user words |
1E2. | uvhd hex dumps of aspell dictionaries to illustrate binary format |
1E3. | Operating Instructions to extract dictionary words and sort |
1F1. | Operating Instructions to extract words-used from all user documentation |
- reads all files in directory and sorts | |
- writes out multiple words per line to about column 70 |
1G1. | Illustrated usage of wordxtrct1 and discussion of other uses, such as |
extracting words from binary programs (version# for example). |
1G2. | Illustrated usage of wordsort1 and discussion of other uses, such as |
word count analysis. |
1H0. | Summary of uvcopy jobs used in this part |
- listings of some jobs in case you wish to inspect the code | |
or modify for your own purposes |
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
This application should appeal to anybody who has technical text documentation, that has never been spell checked, due to the false alerts caused by technical terms. Some of the components can also be used as standalone general purpose tools to extract words from binary files, or to do word count analysis on your documents.
Both 'aspell' and 'ispell' are interactive spell checkers for text files on linux or unix systems. They highlight misspelled words & allow you to enter a replacement, or even better to pick a suggested replacement by number from a list of alternatives. Or you may enter 'I' (Ignore) to inhibit reporting an error for that word for the remainder of the current document. You may find aspell documentation at https://www.aspell.net.
The major problem, with using spell checkers for technical documentation, is the large number of technical words and acronyms that get reported as errors. You can use the 'I' command to Ignore for the remainder of any one document, but it gets very tedious when you have dozens or hundreds of files to be checked.
Aspell provides part of the solution by allowing you to specify a 'word-list' file of words that are to be added to the aspell supplied master dictionary.
However it is still tedious to prepare the word-list file. On each document you discover more technical words which you write down, for updating the word-list, before spell checking the next document.
I will later show you how to automatically generate the wordlist file of technical words from all your documents, but first I will illustrate how I use aspell at my site.
I use the 'vi' editor to maintain my documentaion (over 100 text files in the /home/uvadm/doc subdirectory). After significant updates with 'vi', I automatically convert the text files to HTML files and FTP to my web site. Please see HTMLjobs.htm if you wish to see how legacy text files can be converted to HTML automatically.
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
/home/uvadm :------------>.aspell.conf - personal config file (in home dir) :-----ctl : :------>aspell_words_ok.txt - personal wordlist to add to dictionary : : - prepared with vi & input to 'apsell create' : :------>aspell_words_ok - personal wordlist converted for aspell use :-----doc : :------>text-documents - maintained with 'vi' : :------>...120 files... - spell checked with aspell : :------>text-documents - converted to HTML for FTP to web site :-----docbak : :------>...backups... :-----dochtml : :------>...HTML versions...
# .aspell.conf - personal configuration file for aspell # - to check UVdoc spelling - by OT UVSI May 2004 # effective file stored at: /home/uvadm/.aspell.conf # backup file stored at: /home/uvadm/ctl/aspell.conf # # vi ctl/aspell_words_ok.txt <-- create text file of technical words # ========================== # # aspell create personal ctl/aspell_words_ok < ctl/aspell_words_ok.txt # ==================================================================== # - convert text file to binary format requried by aspell # # aspell -x check filename.doc <-- invoke aspell # ============================ # ignore-case run-together personal /home/uvadm/ctl/aspell_words_ok #
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
#0. login uvadm --> /home/uvadm
#1. cp -r doc docbak <-- backup all doc files (see -x option below) ================ - once before aspell on 100 files
#2. vi ctl/aspell_words_ok.txt <-- create/update text file of technical words ========================== - as necessary during 100 file checks
#3. aspell create personal ctl/aspell_words_ok < ctl/aspell_words_ok.txt ==================================================================== - convert text file to binary format required by aspell
#4. aspell -x check filename.doc <-- invoke aspell ============================
#5. diff docbak/filename.doc doc/filename.doc | more ================================================ - verify aspell actions (optional)
When I first started to spell-check my documents, I used 'vi' to create 'aspell_words_ok.txt', which is then converted to the binary file version by the 'aspell create personal' command.
For the first version of aspell_words_ok.txt, I entered all the technical words I could think of, that I knew I had used in my documents.
But on each aspell session, I found more technical words, that had to be written down, for subsequent updating & recreation of the binary file version at the end of the session.
That was when I decided there had to be a better way. That 'better way' is described on the following pages.
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
Here is an overview of the plan to automatically generate the word-list from your documents and drop out the properly spelled words by matching to words extracted from the aspell master dictionary. You then specify the resulting file as your personal word-list in the aspell personal configuration file.
We will accomplish our task using the 'uvhd' utility and three pre-programmed uvcopy jobs (wordsort1, wordxtrct1, and wordmerge1). We will provide the exact operating instructions later; first we present an overview and some explanations.
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
The previous page outlined our plan to automatically create the technical word-list for aspell from our documents and the aspell dictionaries. The first task is to convert the binary aspell dictionaries to a text file of words separated by spaces (not nulls), so we can match to our words-used list.
The aspell dictionaries can be found at /usr/lib/aspell. These are binary files with the words separated by nulls (not spaces). They also contain a lot of phonetic control codes that we will omit. Fortunately, the words are grouped in contiguous blocks near the begining of the file.
We can use 'uvhd' to locate the begining and ending block numbers. Then we can reposition to the first block of words and use the 'w'rite command to write out the calculated number of blocks. Here are the aspell dictionary files (and the begining and ending block numbers of the words) that I found on my sytem (Red Hat Enterprise Linux 3.0).
words words write aspell filename total-blks start end blocks ============================================================== american-med-only | 592 | 17 | 217 | 201 | english-med-only | 15824 | 17 | 5685 | 5669 | english-variant-0 | 96 | 17 | 30 | 14 | english-variant-1 | 400 | 17 | 119 | 103 | english-variant-2 | 400 | 17 | 125 | 109 |
Here are the uvhd operating instructions to write out the words from the biggest file. Note that uvhd writes output files to the 'tmp' subdirectory using the same name, with a date/time stamp appended.
uvhd /usr/lib/aspell/english-med-only r256 ========================================== --> 17 <-- determine start of words (block 17) --> 5685 <-- browse to end of words (block 5685) --> 17 <-- reposition to start of words --> w5669 <-- write out dictionary words (5685-17=5669) --> q <-- quit uvhd
ls -l tmp <-- observe filename written by uvhd ========= tmp/english-med-only.yymmddhhmmW <-- note date/time stamped file in tmp
The next page illustrates the 'uvhd' browsing and writing --->
See the uvhd instructions for all five files two pages ahead --->
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
uvhd /usr/lib/aspell/english-med-only r256 ==========================================
10 20 30 40 50 60 r# 1 0123456789012345678901234567890123456789012345678901234567890123 0 aspell rowl 1.3.......=..0...G..{....p.......P.......s.......... 677666276762323001000C30031094007100071010000510AC00D7000D000000 1305CC02F7C01E30000000D000604720BB400000E00000105D00731000508000 64 ................english.phonet.1.1.............................. 00000000FFFF0000666667607666670323000000000000000000000000000000 70004000FFFF10005E7C938008FE5401E1000000000000000000000000000000 128 ................................................................ 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000
--> 17 <-- position to block #17 10 20 30 40 50 60 r# 17 0123456789012345678901234567890123456789012345678901234567890123 4096 .Hauser's.refinances.catchall.subtrahend.Brandea's.rarebits.petr 0467767270766666666706676666607767766666047666662707676667707677 08153527302569E1E3530314381CC035242185E40221E4517302125294300542 64 ologist's.tolls.fodders.allovers.hearse.disincline.millpond's.mi 6666677270766670666667706666767706667760667666666606666766627066 FCF79347304FCC306F4452301CCF6523085123504939E3C9E50D9CC0FE4730D9 128 strial's.grandfathered.Grenier's.empirical.domestication.carnall 7776662706766666766766047666672706676766660666677666766606676666 34291C730721E4614852540725E9527305D092931C04FD53493149FE0312E1CC 192 y.synonymy.inadequacies.Palmerston's.Mundt's.hearts.tone's.allot 7077666767066666776666705666677766270476672706667770766627066667 9039EFE9D909E1451513953001CD5234FE730D5E4473085124304FE57301CCF4
--> w5669 <-- write 5669 blocks 10 20 30 40 50 60 r# 5685 0123456789012345678901234567890123456789012345678901234567890123 1455104 ract.Ryley's.countersign.annihilations.unrulier.venal.summerhous 7667057667270667676776660666666667666707677666707666607766676677 2134029C597303F5E452397E01EE989C149FE305E25C952065E1C035DD528F53 64 es.telltales.Tome's.overprinted.refinanced.tomboy.thousandfold.F 6707666766670566627067677766766076666666660766667076677666666604 53045CC41C5304FD5730F652029E45402569E1E35404FD2F9048F531E46FC406 128 loyd's.Cthrine's.infant's.mistress's............................ 6676270476766627066666727066777677270000000000000000000000000000 CF9473034829E57309E61E4730D9342533730000000000000000000000000000 192 ................................................................ 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000
w5669 5669 written, tmp/english-med-only.0405241908W
rec#=5685 rcount=15824 rsize=256 fsize=4050944 /usr/lib/aspell/english-med-only null=next,r#=rec,s=search,u=update,p=print,i=iprint,w=write,t=tally,c=checkseq ,R#=Recsize,h1=char,h2=hex,q=quit,?=help --> q <-- quit
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
Please refer back to page '1E1' for the list of aspell dictionary files and the relevant start/end block numbers where the contiguous words are found.
#1a. rm -f tmp/* <-- clear out the 'tmp' subdir
#1b. mkdir tmp1 tmp2 <-- make 2 additional tmp subdirs
#2a. uvhd /usr/lib/aspell/american-med-only r256 ========================================== --> 17 <-- position to start of words --> w201 <-- write out dictionary words --> q <-- quit uvhd
#2b. uvhd /usr/lib/aspell/english-med-only r256 ========================================== --> 17 --> w5669 --> q
#2c. uvhd /usr/lib/aspell/english-variant-0 r256 ========================================== --> 17 --> w14 --> q
#2d. uvhd /usr/lib/aspell/english-variant-1 r256 ========================================== --> 17 --> w14 --> q
#2e. uvhd /usr/lib/aspell/english-variant-2 r256 ========================================== --> 17 --> w14 --> q
#3. ls -l tmp <-- observe filename written by uvhd ========= tmp/american-med-only.yymmddhhmmW tmp/english-med-only.yymmddhhmmW tmp/english-variant-0.yymmddhhmmW tmp/english-variant-1.yymmddhhmmW tmp/english-variant-2.yymmddhhmmW
#4. cat tmp/* >tmp1/aspell_dict_raw
#5. uvcopy wordxtrct1,fili1=tmp1/aspell_dict_raw,filo1=tmp1/aspell_dict_text ======================================================================== - convert the binary file to a text file - space separated multiple words per line to about column 70
#6. uvcopy wordsort1,fili1=tmp1/aspell_dict_text,filo1=tmp2/aspell_dict_sorted ========================================================================== - sort the dictionary words and write output in the same format - space separated multiple words per line to about column 70
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
Please refer back to our master plan on page '1D1'. We have just completed the conversion of aspell dictionary files to a text file and we are now ready to extract the 'words-used' file from all files in our doc directory.
Here are the operating instructions to automatically create a technical word-list from your documents and the aspell master dictionary.
#1. uvcopy wordsort1,fild1=doc,filo1=tmp2/aspell_words_used ======================================================= - create file of all words used in all your text files (technical words, properly spelled words,and misspelled words) - sorted and duplicates removed
#2. uvcopy wordmerge1,fili1=tmp2/aspell_words_used,fili2=tmp2/aspell_dict_sorted ============================================================================ ,filo1=tmp2/aspell_words_ok.txt =============================== - sort/merge words_used with the aspell dictionary, drop duplicates - write out only the unmatched words from the words_used input file
#3. cp tmp2/aspell_words_ok.txt ctl <-- copy to permanent subdir (ctl) =============================== for extended use
#4. vi ctl/aspell_words_ok.txt <-- drop the misspelled words ========================== retaining the technical words (to be considered OK)
#5. aspell create personal ctl/aspell_words_ok < ctl/aspell_words_ok.txt ==================================================================== - convert the text file to the binary format required by aspell
#6. aspell -x check doc/filename.doc <-- invoke aspell for each file ================================ - repeat for 100 files
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
'wordxtrct1' was executed on page '1E3' as part of the plan to create as aspell technical word-list to enable easier spell checking of user documents. 'wordxtrct1 is also intended as a general purpose pre-programmed job, that you might find useful for applications other than spell-checking.
The wordxtrct1 command is re-executed here to illustrate the options and show some of the output file. We will then discuss other possible uses.
#5. uvcopy wordxtrct1,fili1=tmp1/aspell_dict_raw,filo1=tmp1/aspell_dict_text ========================================================================
uop=d1t1 - option defaults d1 - delete single character words d2 - delete 2 character words (d0=no deletes, d3=max char deletes) t1 - translate to lower case (t0 do not translate) User OPtion (uop) defaults = q1d1t1 null to accept or re-specify (1 or more) --> <-- null to accept defaults
psychoanalyzed nasalizing compartmentalization specializing epitomize succored leukemia glamorization fossilized temporizing's eyer polymerization's raveling flavor's criminalization vulcanization's motorizing succorer criticizinglies resymbolizations colorfastnesses sensitize homeostatic vialed unmechanizes programming colorfastness colorfully popularization organizing donutting individualizer diagonalize analogized anesthesia's analogizes revisualizes centerboard cenobites evangelizing theorizer behoove modernizations apologized mesmerizer's
This is just the first few lines of 22,000 total lines and 160,000 total words. You will notice that the words are not in sequence. I think the placement is determined by the phonetic algorithms used by aspell. The next step (#6) in the plan on page '1E1' will sort them to sequence.
You might find other uses for wordxtrct1. For example, you might want to extract text information from an executable program, such as version number, or help information. You could try this on some of the unix/linux bin programs:
1. uvcopy wordxtrct1,fili1=/bin/cat,filo1=tmp/cat.text ===================================================
2. vi cat_text ===========
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
'wordsort1' was executed on page '1F1' as part of the plan to create an aspell technical word-list to enable easier spell checking of user documents. 'wordsort1 is also intended as a general purpose pre-programmed job that you might find useful for applications other than spell-checking.
The wordsort1 command is re-executed here to illustrate the options and show some of the output file. We will then discuss other possible uses.
#1. uvcopy wordsort1,fild1=doc,filo1=tmp2/aspell_words_used =======================================================
uop=a1c65d1m2n30s0t1 - option defaults a1 - accept alpha characters (default for aspell) a2 - accept numerics (use a3 for alphanumeric a1+a2) a4 - accept punctuation (use a7 for all chars a1+a2+a4) c65 - max column exceeded to output multi-word lines d1 - drop 1 character words (d0=do NOT drop) m2 - minimum word length (drop if less) n30 - maximum word length (drop if more) s1 - statistics, word count appended at end each word(9) s0 - statistics turned off t1 - translate to lower case (t0 do not translate) User OPtion (uop) defaults = q1a1c65d1m2n30s0t1 null to accept or re-specify (1 or more) --> <-- null to accept defaults
ab abab abanta abbreviated abbreviation abbreviations abc abcco abcd abcde abcdef abcdefghi abcfile abcxyz abend abended ability able abndcd abnormal abnormally abort aborted aborting about above abrasrt absence absent absolute absolutely abterm abudfil abudrec abudytd ac academic acaps acc accept accept'ed accept's acceptable accepted accepting accepts acceptu access accessed accesses accessible accessing accfix accidental accidentally accname accommodate accommodated accomodate accomodated accompanied accompany accomplish accomplished accordingly account accounted accounting accountno accounts accouts accpare accrdt accreg accross accrual acct acctest acctmas acctmstr acctounting
You might use wordsort1 to determine word usage statistics. Please see the example on the next page, where we will rerun the same job shown above, using the statistics option.
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
We will re-execute 'wordsort1' with the 'statistics' option, which appends the duplicate count onto the end of each word.
We will specify the 's1' option by appending ',uop=s1' onto the command line. Alternatively we could enter 's1' at the options prompt (see previous page).
#1. uvcopy wordsort1,fild1=doc,filo1=tmp2/aspell_words_used,uop=s1 ========================================================******
ab(89) abab(2) abanta(2) abbreviated(1) abbreviation(2) abbreviations(4) abc(66) abcco(4) abcd(5) abcde(1) abcdef(4) abcdefghi(12) abcfile(1) abcxyz(10) abend(18) abended(1) ability(7) able(30) abndcd(1) abnormal(3) abnormally(14) abort(1) aborted(17) aborting(1) about(139) above(965) abrasrt(1) absence(38) absent(148) absolute(14) absolutely(1) abterm(56) abudfil(4) abudrec(2) abudytd(3) ac(6) academic(3) acaps(4) acc(72) accept(432) accept'ed(1) accept's(7) acceptable(1) accepted(9) accepting(1) accepts(13) acceptu(1) access(151) accessed(11) accesses(12) accessible(3) accessing(8) accfix(6) accidental(6) accidentally(4) accname(3) accommodate(8) accommodated(2) accomodate(2) accomodated(1) accompanied(2) accompany(2)
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page
wordxtrct1 |
|
wordsort1 |
|
wordmerge1 |
|
If you have the Vancouver Utilities installed on your machine, you can view or list as follows (using wordxtrct1 as an example).
vi /home/uvadm/pf/util/wordxtrct1 <-- inspect with vi =================================
uvlp12 /home/uvadm/pf/util/wordxtrct1 <-- list at 12 cpi =====================================
Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page