GREP and related tools for CPDP users The following commands work in the unix bash shell just as well as in TextWrangler or other such programs, but the focus is on unix. In the following 'regex' stands for any well-formed regular expression; see the TextWrangler manual, man grep under unix or search the web for how to formulate correct regexes. The basic units of regexes are regular characters that match themselves ('a' matches all instances of 'a'), plus special characters: - wildcards: '.' for any character, '^' for any beginning of a line, and '$' for any end of a line - positions: '\b' for any edge of a word, '\B' for all else - operators: '*' for 'zero or more instances of the preceding character', '+' for 'one or more', '{2,4}' for 'at least two but not more than four, as in 'aa', 'aaa', 'aaaa', '|' for 'or' (Use '(' and ')' to define the scope of operators, e.g. '(ab){2,3}' finds 'abab' and 'ababab', not 'aa', 'aaa' etc. - classes: [ptkbdgmanŋyrls] 'some consonants' '\W': any non-word character '\n': line break Important: there are two kinds of grep: grep and egrep. They do pretty much the same but the special characters like ?, |, +, (, { , }, ) behave differently: under grep, these symbols need a blackslash in order to have their special meaning: grep 'a?' matches 'a?' grep 'a\?' matches a, aa, ab, ac, etc, i.e. 'a' optionally followed by another character. under egrep, this is reversed: egrep 'a?' matches a, aa, ab, ac, etc, i.e. 'a' optionally followed by another character. egrep 'a\?' matches 'a?' NB: the symbols .,*, ^, \, -, etc. always have a special function in both grep and egrep, and if one wants to match them literally (i.e. find actual instances of '*'), they need to backslashed in both grep and egrep: grep '\*' etc. In the Unix shell, indicate the file in which you want to search either by: cat my_file.txt | egrep 'regex' or by: egrep 'regex' my_file.txt With wildcard filenames, the first option strings all files together for searching, e.g. (where the option -c results in counts instead of a list of all matches; cf. below) cat *.txt | egrep -c 'regex' (counts how many tokens of 'regex' there are in all *.txt files) egrep -c 'regex' *.txt (counts the tokens per file) In TextWrangler, indicate the file or set of files using the 'multiple file' tab and only type the regex into the search field. The results of grep are normally printed on the terminal, but they can be saved anywhere else with '>' cat my_file.txt | egrep 'regex' > my_results.txt Or view the results page by page, by piping to 'more' or 'less', e.g. cat my_file.txt | egrep 'regex' | more (But, first, type "export "LESSCHARSET=utf-8"" in order to tell 'more' and 'less' that you want unicode symbols displayed as unicode symbols.) On the mac, use "open -f" to display the results in your default text editor: cat my_file.txt | egrep 'regex' | open -f Even better: pipe the output through a 'sed' command that cleans it up in such a way that words are separated by tabs (not spaces), for easy insertion into Word or HTML documents: egrep -A1 'regex' my_file.txt | sed -e 's/\\m.. //g' -e 's/- \{1,\}/-/g' -e 's/ \{1,\}-/-/g' | sed -e 's/ \{1,\}/\t/g' | open -f -a 'Jedit X' (here, 's/\\m.. //g' globally ('/g') substitutes ('s/') the regex '\\m.. ' by nothing ('//'), i.e. it removes the toolbox tier '\mgl ' and '\mph ' (remove also others if you need to). Then, we substitute one or more spaces preceded by a hyphen ('- \{1,\}') and the reverse of that (' \{1,\}-') by single hyphens ('/-/') because toolbox inserts unequal numbers of spaces between morphemes; and finally, we substitute the remaining empty spaces between words by tabs ('/\t/') (NB: '\t' only works with BSD sed 4.0.5, installable on the Mac via FINK at http://www.finkproject.org/download/; or in GNU sed, i.e. gsed, installable via MacPorts, http://www.macports.org) Some further options (for more, see man grep): egrep -C5 'regex' my_file.txt (prints 5 lines before and after each match, for context) egrep --color 'regex' (highlight the matched string) egrep -n 'regex' (show line numbers) egrep -c 'regex' (count instead of print) Instead of egrep -c '\\gw.*\b[:alnum:]\b', you can use 'wc' ('word count') wc *txt (see below for a ready-made command line counting the total number of words in the entire corpus.) NB: you can run cat, grep, wc, sed etc directly on the server. That way, you can be sure you have the latest and most complete corpus. To do this, login via ssh 139.18.14.96 Then navigate to the CPDP folder with cd ../../CPDP and then, within this, cd to the relevant toolbox file directory. There, you simply type the cat, grep etc. commands as usual. Some useful regexes and command pipes: Search for all lines that include both "ERG" and "vt" egrep -n 'ERG.*vt' Search for all lines that do not contain "ERG" egrep -v 'ERG' Note: I don't think this can be done easily in TextWrangler. The "exclude" option only pulls out files (not lines) that don't contain 'ERG'; and there is no 'Don't find' for multiple files. Search for all lines that contain "ERG" and count how many of them also contain "TEL": cat *.txt | egrep 'ERG' | egrep -c 'TEL' Search for all lines that contain "vt" but not also ERG (i.e. clauses with transitive verbs but no overt ergative): cat *.txt | egrep 'vt' | egrep -v 'ERG' (output data) cat *.txt | egrep 'vt' | egrep -vc 'ERG' (output count only) Careful: "vi" may also happens to be part of a word, e.g. "village". To avoid matching this, use'vi\b' where \b signals a blank at the edge of a word, which identifies PoS tags since they are always last in the word. Search for all lines that contain the suffix "ca" and the gloss "TEL" (ideally one would want to only search for lines that contain the suffix "ca" and the gloss "TEL" *at the same position* -- I don't know yet how to do this): cat *.txt | egrep -B2 'TEL' | egrep --color -A2 -B2 'ca\b' We use "-B2" in order to not only output the matching line (i.e. the one containing "TEL") but also the glossing and gw tiers before this ('B' for 'before') it. This is then search space for the next grep command in the pipeline. (Here printing the results together with the gloss and the gw tier and adding the blank space condition \b so that we don't collect strigs like "camce".) Search for all lines that contain the suffix -k but not the gloss "GEN": egrep -A1 '\\mph.*\-k\W' *.txt | egrep -v ' \-GEN ' | egrep -B1 'mgl' | more NB: the first two grep commands print all lines that contain a -k suffix because 'egrep -v ' \-GEN' only gets rid of mgl lines that contain GEN. It keeps the mph lines because they trivially don't contain GEN. To get rid of them we only want matches with 'mgl' Search for all lines that contain the suffix -ni after consonants egrep -A1 -B1 '\\mph.*\-ni ' *.txt | egrep -B1 -A2 '[ptkbdgmnŋsrlyw]ni'| more where [ptkbdgmnŋsrlyw] defines the set of relevant consonants (here, the possible stem-final consonants in Puma) Search only among utterances by participant XY, as registered in tier "EUDICOp": cat my_file.txt | egrep -A5 "EUDICOp XY" | egrep -c "regex" We use "-A5" in order to output not only the matching line (i.e. the one with the tag \EUDICOp) but also the 5 tiers following ('A' for 'after') it. This is then search space for the next grep command in the pipeline. The number of tiers must be set in such a way that all relevant tiers of the same record (transcript + gloss + translation package) are included, and that no line of the next record is included. Count phonological words in the entire corpus (= the current directory), as defined on the \tx tier: grep '^\\tx' ./*.txt | wc -w | echo $(($(cat)-$(grep '^\\tx' ./*.txt | wc -l))) Note: for gwords, replace \tx by \gw Search for all toolbox tags (e.g. to get rid of them): grep '\\\b.{2,3}\b' Search for all words with at least two prefixes: grep '\\mgl.*[1-3a-zA-Z/>]- [1-3a-zA-Z/>]*- ' *.txt Search for all examples with POSS in the gloss and a geminate in the transcript: egrep -B2 "POSS" | egrep -A2 --color "n{2,}" Search for all CVC verbs with augment -t (-d) egrep -A5 '(lex|alt) .*[ptkmnŋ][td]' Chintang-Lex.db | egrep -B4 -A1 '\\ps v' | egrep -B1 -A6 '\\ct .*[ptkmnŋ]ma' | open -f Same for CV-t/d: egrep -A5 '(lex|alt) .*[aeiou][tdr]' Chintang-Lex.db | egrep -B4 -A1 '\\ps v' | egrep -B1 -A6 '\\ct .*[aeiou]ma' | open -f Print only the glosses, after removing the tags and sorting: egrep -A5 '(lex|alt) .*[ptkmn\305\213][td]' Chintang-Lex.db | egrep -B4 -A1 '\\ps v' | egrep -B1 -A6 '\\ct .*[ptkmn\305\213]ma' | egrep '\\ge' | sed -e 's/\\ge //g' | sort | open -f ---------------------------------------------------------------------------- Unicode problems: (e)grep causes problems with patterns like \b or \< when combined with unicode strings. Here is an alternative searching for all words beginning with two velar nasals: egrep --color " (ŋ){2,}|^(ŋ){2,}" NB: there may be match problems caused by non-standard linebreaks (CR) inserted by Windows into toolbox files. Solve this by: iconv -f utf-8 -t utf-8 -c FILE.NAME | tr -d '\r' > NEWFILE.NAME Other useful commands in the shell: uniq: extract types from tokens sort: sort e.g. cat xy | grep xy | sort | uniq egrep "EUDICOp" CLLDCh3R02S01.txt | sort | uniq