74

GitHub - BLKSerene/Wordless: An Integrated Corpus Tool with Multi-language Suppo...

 5 years ago
source link: https://github.com/BLKSerene/Wordless
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

logo

Wordless is an integrated corpus tool with multi-language support for the study of language, literature and translation designed and developed by Ye Lei (叶磊), MA student in interpreting studies at Shanghai International Studies University (上海外国语大学).

License

Copyright (C) 2018-2019  Ye Lei (叶磊)

This project is licensed under GNU GPLv3.
For details, see: https://github.com/BLKSerene/Wordless/blob/master/LICENSE.txt

All other rights reserved.

Citing

If you publish work that uses Wordless, please cite as follows.

MLA (8th Edition):

Ye Lei. Wordless, version 1.0, 2019, https://github.com/BLKSerene/Wordless.

APA (6th Edition):

Ye, L. (2019) Wordless (Version 1.0) [Computer Software]. Retrieved from https://github.com/BLKSerene/Wordless

GB (GB/T 7714—2015):

叶磊. Wordless version 1.0[CP]. (2019). https://github.com/BLKSerene/Wordless.

Download

Wordless currently support Windows Vista/7/8/10 and macOS 10.6+, 64-bit only.

Download the latest version for Windows (unzip the file and click Wordless/Wordless.exe to run)
Download the latest version for macOS (unzip the file and click Wordless.app to run)

Chinese users with slow connections to Github can download from Baidu Netdisk (password: 03el).

Download older versions

Documentation

English Documentation

中文文档

Need Help?

If you encounter a problem, find a bug or require any further information, feel free to ask questions, submit bug reports or provide feedback by creating an issue on Github if you fail to find the answer by searching existing issues first.

If you need to post sample texts or other information that cannot be shared or you do not want to share publicly, you may send me an email.

Home Page: https://github.com/BLKSerene/Wordless
Documentation: https://github.com/BLKSerene/Wordless#documentation
Email: [email protected]
WeChat Official Account: Wordless

Important Note: I CANNOT GUARANTEE that all emails will always be checked or replied in time. I WILL NOT REPLY to irrelevant emails and I reserve the right to BLOCK AND/OR REPORT people who send me spam emails.

Contributing

If you have an interest in helping the development of Wordless, you may contribute bug fixes, enhancements or new features by creating a pull request on Github.

Besides, you may contribute by submitting enhancement proposals or feature requests, write tutorials or Github Wiki for Wordless, or helping me translate Wordless and its documentation to other languages.

Donating

If you would like to support the development of Wordless, you may donate via PayPal, Alipay or WeChat.

PayPal Alipay WeChat PayPal Alipay WeChat

Important Note: I WILL NOT PROVIDE refund services, private email/phone support, information concerning my social media, gurantees on bug fixes, enhancements, new features or new releases of Wordless, invoices, receipts or detailed weekly/monthly/yearly/etc. spending report for donation.

Acknowledgments

Wordless stands on the shoulders of giants. Thus, I would like to extend my thanks to the following open-source projects:

General

  1. Python by Guido van Rossum, Python Software Foundation
  2. PyQt by Riverbank Computing Limited

Natural Language Processing

  1. jieba by Sun Junyi
  2. nagisa by Taishi Ikeda (池田大志)
  3. NLTK by Steven Bird, Liling Tan
  4. pybo by Hélios Drupchen Hildt
  5. pymorphy2 by Mikhail Korobov
  6. PyThaiNLP by Wannaphong Phatthiyaphaibun (วรรณพงษ์ ภัททิยไพบูลย์)
  7. SacreMoses by Liling Tan
  8. spaCy by Matthew Honnibal, Ines Montani
  9. Underthesea by Vu Anh

Plotting

  1. Matplotlib by Matplotlib Development Team
  2. wordcloud by Andreas Christian Mueller

Miscellaneous

  1. Beautiful Soup by Leonard Richardson
  2. cChardet by Yoshihiro Misawa
  3. chardet by Daniel Blanchard
  4. langdetect by Michal Mimino Danilak
  5. langid.py by Marco Lui
  6. lxml by Stefan Behnel
  7. NumPy by NumPy Developers
  8. openpyxl by Eric Gazoni, Charlie Clark
  9. PyInstaller by Hartmut Goebel
  10. python-docx by Steve Canny
  11. requests by Kenneth Reitz
  12. SciPy by SciPy Developers
  13. xlrd by Stephen John Machin

Data

  1. grk-stoplist by Annette von Stockhausen
  2. lemmalist-greek by Michael Stenskjær Christensen
  3. Lemmatization Lists by Michal Boleslav Měchura
  4. Stopwords ISO by Gene Diaz

Documentation - English

Main Window [Back to Contents]

The main window of Wordless is divided into several sections:

  1. Menu Bar

  2. Work Area:
    The Work Area is further divided into the Resutls Area on the left side and the Settings Area on the right side.
    You can click on the tabs at the top to toggle between different panels.

  3. File Area:
    The File Area is further divided into the File Table on the left side and the Settings Area on the right side.

  4. Status Bar:
    You can show/hide the Status Bar by checking/unchecking Menu → Preferences → Show Status Bar

File Area [Back to Contents]

In most cases, the first thing to do in Wordless is open and select your files to be processed via Menu → File or by clicking the buttons residing under the File Table.

Files are selected by default after being added to the File Table. Only selected files will be processed by Wordless. You can drag and drop files around the File Table to change their orders, which will be reflected in the results produced by Wordless.

By default, Wordless will try to detect the language, text type and encoding of the file, you should check and make sure that the settings of each and every file is correct. If you do not want Wordless to detect the settings for you and prefer setting them manually, you can change the settings in Auto-detection Settings in the Settings Area.

  1. Add File(s):
    Add one single file or multiple files to the File Table.

    * You can use the Ctrl key (Command key on macOS) and/or the Shift key to select multiple files.

  2. Add Folder:
    Add all files in the folder to the File Table.

    By default, all files in subfolders (and subfolders of subfolders, and so on) will also be added to the File Table. If you do not want to add files in subfolders to the File Table, uncheck Folder Settings → Subfolders in the Settings Area.

  3. Reopen Closed File(s):
    Add file(s) that are closed the last time back to the File Table.

    * The history of all closed files will be erased upon exit of Wordless.

  4. Select All:
    Select all files in the File Table.

  5. Invert Selection:
    Select all files that are not currently selected and deselect all currently selected files in the File Table.

  6. Deselect All:
    Deselect all files in the File Table.

  7. Close Selected:
    Remove all currently selected files in the File Table.

  8. Close All:
    Remove all files in the File Table.

Overview [Back to Contents]

In Overview, you can check/compare the language features of different files.

  1. Count of Paragraphs:
    Number of paragraphs in each file. Each line in the file will be counted as one paragraph. Blank lines and lines containing only spaces, tabs and other invisible characters are ignored.

  2. Count of Sentences:
    Number of sentences in each file. Wordless will automatically apply the built-in sentence tokenizer according to the language of each file in order to calculate the number of sentences in each file. You can change the sentence tokenizer settings via Menu → Preferences → Settings → Sentence Tokenization → Sentence Tokenizer Settings.

  3. Count of Tokens:
    Number of tokens in each file. Wordless will automatically apply the built-in word tokenizer according to the language of each file in order to calculate the number of tokens in each file. You can change the word tokenizer settings via Menu → Preferences → Settings → Word Tokenization → Word Tokenizer Settings.

    You can specify what should be counted as a "token" via Token Settings in the Settings Area

  4. Count of Types:
    Number of token types in each file.

  5. Count of Caracters:
    Number of single characters in each file. Spaces, tabs and all other invisible characters are ignored.

  6. Type-Token Ratio:
    Number of token types divided by number of tokens.

  7. Type-Token Ratio (Standardized):
    Standardized type-token ratio. Each file will be divided into several sub-sections with each one consisting of 1000 tokens by default and type-token ratio will be calculated for each part. The standardized type-token ratio of each file is then averaged out over all sub-sections. You can change the number of tokens in each sub-section via Generation Settings → Base of standardized type-token ratio.

    The last section will be discarded if the number of tokens in it is smaller than the base of standardized type-token ratio in order to prevent the result from being affected by outliers (extreme values).

  8. Average Paragraph Length (in Sentence):
    Number of sentences divided by number of paragraphs.

  9. Average Paragraph Length (in Token):
    Number of Tokens divided by number of paragraphs.

  10. Average Sentence Length (in Token):
    Number of tokens divided by number of sentences.

  11. Average Token Length (in Character):
    Number of characters divided by number of tokens.

  12. Count of n-length Tokens:
    Number of n-length tokens, where n = 1, 2, 3, etc.

Overview Table

Concordancer [Back to Contents]

In Concordancer, you can search for any token in different files and generate concordance lines. You can adjust the settings for the generated data via Generation Settings.

After the concordance lines are generated and displayed in the table, you can sort the results by clicking Sort Results or search in results by clicking Search in Results, both buttons residing at the right corner of the Results Area.

In addition, you can generate concordance plots for any search term. You can modify the settings for the generated figure via Figure Settings. By default, data in concordance plot are sorted by file. You can sort the data by search term instead via Figure Settings → Sort Results by.

  1. Left:
    The context before each search term, which displays 10 tokens left to the Node by default. You can change this behavior via Generation Settings.
  2. Node:
    Nodes are search terms specified in Search Settings → Search Term.
  3. Right:
    The context after each search term, which displays 10 tokens right to the Node by default. You can change this behavior via Generation Settings.
  4. Token No.
    The position of the first token of Node in each file.
  5. Sentence No.
    The position of the sentence in which the Node is found in each file.
  6. Paragraph No.
    The position of the paragraph in which the Node is found in each file.
  7. File
    The file in which the Node is found.

Concordance Table Concordance Figure - File Concordance Figure - Search Term

Wordlist [Back to Contents]

In Wordlist, you can generate wordlists for different files and calculate the raw frequency, relative frequency, dispersion and adjusted frequency for each token.

In addition, you can generate line charts or word clouds for wordlists using any statistics. You can modify the settings for the generated figure via Figure Settings.

Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.

  1. Rank:
    The rank of the token sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers.

  2. Tokens:
    You can specify what should be counted as a "token" via Token Settings.

  3. Frequency:
    The number of occurrences of the token in each file.

  4. Dispersion:
    The dispersion of the token in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See Measures of Dispersion & Adjusted Frequency for more details.

  5. Adjusted Frequency:
    The adjusted frequency of the token in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See Measures of Dispersion & Adjusted Frequency for more details.

  6. Number of Files Found:
    The number of files in which the token appears at least once.

Wordlist Table Wordlist Figure - Line Chart Wordlist Figure - Word Cloud

N-grams [Back to Contents]

In N-grams, you can search for n-grams (consecutive tokens) or skip-grams (non-consecutive tokens) in different files, count and compute the raw frequency and relative frequency of each n-gram/skip-gram, and calculate the dispersion and adjusted frequency for each n-gram/skip-gram using different measures. You can adjust the settings for the generated data via Generation Settings. To allow skip-grams in the results, check Generation Settings → Allow skipped tokens and modify the settings. You can also set constraints on the position of search terms in all n-grams via Search Settings → Search Term Position.

It is possible to disable searching altogether and generate an exhausted list of n-grams/skip-grams by unchecking Search Settings for each file, but it is not recommended to do so, since the processing speed might be to slow.

In addition, you can generate line charts or word clouds for n-grams using any statistics. You can modify the settings for the generated figure via Figure Settings.

Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.

  1. Rank:
    The rank of the n-gram sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers.

  2. N-grams:
    You can specify what should be counted as a "n-gram" via Token Settings.

  3. Frequency:
    The number of occurrences of the n-gram in each file.

  4. Dispersion:
    The dispersion of the n-gram in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See Measures of Dispersion & Adjusted Frequency for more details.

  5. Adjusted Frequency:
    The adjusted frequency of the n-gram in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See Measures of Dispersion & Adjusted Frequency for more details.

  6. Number of Files Found:
    The number of files in which the n-gram appears at least once.

N-grams Table N-grams Figure - Line Chart N-grams Figure - Word Cloud

Collocation [Back to Contents]

In Collocation, you can search for patterns of collocation (tokens that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of tokens and calculate the effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings.

It is possible to disable searching altogether and generate an exhausted list of patterns of collocation by unchecking Search Settings for each file, but it is not recommended to do so, since the processing speed might be to slow.

In addition, you can generate line charts or word clouds for patterns of collocation using any statistics. You can modify the settings for the generated figure via Figure Settings.

Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.

  1. Rank:
    The rank of the collocating token sorted by the p-value of the significance test conducted on the node and the collocating token in the first file in ascending order (by default). You can sort the results again by clicking the column headers.

  2. Nodes:
    The search term. You can specify what should be counted as a "token" via Token Settings.

  3. Collocates:
    The collocating token. You can specify what should be counted as a "token" via Token Settings.

  4. Ln, ... , L3, L2, L1, R1, R2, R3, ... , Rn:
    The number of co-occurrences of the node and the collocating token with the collocating token at the given position in each file.

  5. Frequency:
    The total number of co-occurrences of the node and the collocating token with the collocating token at all possible positions in each file.

  6. Test Statistic:
    The test statistic of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

    Please note that test statistic is not avilable for some tests of statistical significance.

  7. p-value:
    The p-value of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

  8. Bayes Factor:
    The bayes factor of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

    Please note that bayes factor is not avilable for some tests of statistical significance.

  9. Effect Size:
    The effect size of the node and the collocating token in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See Tests of Statistical Significance & Measures of Effect Size for more details.

  10. Number of Files Found:
    The number of files in which the the node and the collocating token co-occur at least once.

Collocation Table Collocation Figure - Line Chart Collocation Figure - Word Cloud

Colligation [Back to Contents]

In Colligation, you can search for patterns of colligation (parts of speech that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of parts of speech and calculate the effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings.

Wordless will automatically apply its built-in POS tagger on every file that are not POS-tagged already according to the language of each file. If POS-tagging is not supported for the given languages, the user should provide a file that has already been POS-tagged and make sure that the correct Text Type has been set on each file.

It is possible to disable searching altogether and generate an exhausted list of patterns of colligation by unchecking Search Settings for each file, but it is not recommended to do so, since the processing speed might be to slow.

In addition, you can generate line charts or word clouds for patterns of colligation using any statistics. You can modify the settings for the generated figure via Figure Settings.

Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.

  1. Rank:
    The rank of the collocating part of speech sorted by the p-value of the significance test conducted on the node and the collocating part of speech in the first file in ascending order (by default). You can sort the results again by clicking the column headers.

  2. Nodes:
    The search term. You can specify what should be counted as a "token" via Token Settings.

  3. Collocates:
    The collocating part of speech. You can specify what should be counted as a "token" via Token Settings.

  4. Ln, ... , L3, L2, L1, R1, R2, R3, ... , Rn:
    The number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at the given position in each file.

  5. Frequency:
    The total number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at all possible positions in each file.

  6. Test Statistic:
    The test statistic of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

    Please note that test statistic is not avilable for some tests of statistical significance.

  7. p-value:
    The p-value of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

  8. Bayes Factor:
    The bayes factor of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

    Please note that bayes factor is not avilable for some tests of statistical significance.

  9. Effect Size:
    The effect size of the node and the collocating part of speech in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See Tests of Statistical Significance & Measures of Effect Size for more details.

  10. Number of Files Found:
    The number of files in which the the node and the collocating part of speech co-occur at least once.

Colligation Table Colligation Figure - Line Chart Colligation Figure - Word Cloud

Keywords [Back to Contents]

In Keywords, you can search for candidates of potential keywords (tokens that have far more or far less frequency in the observed file than in the reference file) in different files given a reference corpus, conduct different tests of statistical significance on each keyword and calculate the effect size for each keyword using different measures. You can adjust the settings for the generated data via Generation Settings.

In addition, you can generate line charts or word clouds for keywords using any statistics. You can modify the settings for the generated figure via Figure Settings.

Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.

  1. Rank:
    The rank of the keyword sorted by the p-value of the significance test conducted on the keyword in the first file in ascending order (by default). You can sort the results again by clicking the column headers.

  2. Keywords:
    The candidates of potantial keywords. You can specify what should be counted as a "token" via Token Settings.

  3. Frequency (in Reference File):
    The number of co-occurrences of the keywords in the reference file.

  4. Frequency (in Observed Files):
    The number of co-occurrences of the keywords in each observed file.

  5. Test Statistic:
    The test statistic of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

  6. p-value:
    The p-value of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

  7. Bayes Factor:
    The bayes factor of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.

    Please note that bayes factor is not avilable for some tests of statistical significance.

  8. Effect Size:
    The effect size of on the keyword in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See Tests of Statistical Significance & Measures of Effect Size for more details.

  9. Number of Files Found:
    The number of files in which the keyword appears at least once.

Keywords Table Keywords Figure - Line Chart Keywords Figure - Word Cloud

Supported Languages [Back to Contents]

Languages Sentence Tokenization Word Tokenization Word Detokenization POS Tagging Lemmatization Stop Words Afrikaans ⭕️ ✔ ⭕️ ✖️ ✖️ ✔️ Albanian ⭕️ ✔ ⭕️ ✖️ ✖️ ✔️ Arabic ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Armenian ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Asturian ⭕️ ⭕️ ⭕️ ✖️ ✔️ ✖️ Azerbaijani ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Basque ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Bengali ⭕️ ✔️ ⭕️ ✖️ ✖️ ✖️ Breton ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Bulgarian ⭕️ ✔️ ⭕️ ✖️ ✔️ ✔️ Catalan ⭕️ ✔️ ✔️ ✖️ ✔️ ✔️ Chinese (Simplified) ✔ ✔️ ✔️ ✔️ ✖️ ✔️ Chinese (Traditional) ✔ ✔️ ✔️ ✔️ ✖️ ✔️ Croatian ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Czech ✔ ✔️ ✔️ ✖️ ✔️ ✔️ Danish ✔ ✔️ ⭕️ ✖️ ✖️ ✔️ Dutch ✔ ✔️ ✔️ ✔️ ✔️ ✔️ English ✔ ✔️ ✔️ ✔️ ✔️ ✔️ Esperanto ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Estonian ✔ ⭕️ ⭕️ ✖️ ✔️ ✔️ Finnish ✔ ✔️ ✔️ ✖️ ✖️ ✔️ French ✔ ✔️ ✔️ ✔️ ✔️ ✔️ Galician ⭕️ ⭕️ ⭕️ ✖️ ✔️ ✔️ German ✔ ✔️ ✔️ ✔️ ✔️ ✔️ Greek (Ancient) ⭕️ ⭕️ ⭕️ ✖️ ✔️ ✔️ Greek (Modern) ✔ ✔️ ✔️ ✔️ ✔️ ✔️ Hausa ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Hebrew ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Hindi ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Hungarian ⭕️ ✔️ ✔️ ✖️ ✔️ ✔️ Icelandic ⭕️ ✔️ ✔️ ✖️ ✖️ ✔️ Indonesian ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Irish ⭕️ ✔️ ⭕️ ✖️ ✔️ ✔️ Italian ✔ ⭕️ ⭕️ ✔️ ✔️ ✔️ Japanese ✔ ⭕️ ✔️ ✔️ ✖️ ✔️ Kannada ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Kazakh ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Korean ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Kurdish ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Latin ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Latvian ⭕️ ✔️ ✔️ ✖️ ✖️ ✔️ Lithuanian ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Malay ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Manx ⭕️ ⭕️ ⭕️ ✖️ ✔️ ✖️ Marathi ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Nepali ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Norwegian Bokmål ✔ ✔️ ⭕️ ✖️ ✖️ ✔️ Norwegian Nynorsk ✔ ⭕️ ⭕️ ✖️ ✖️ ✔️ Persian ⭕️ ✔️ ⭕️ ✖️ ✔️ ✔️ Polish ✔ ✔️ ✔️ ✖️ ✖️ ✔️ Portuguese ✔ ✔️ ✔️ ✔️ ✔️ ✔️ Romanian ⭕️ ✔️ ✔️ ✖️ ✔️ ✔️ Russian ⭕️ ✔️ ✔️ ✔️ ✔️ ✔️ Scottish Gaelic ⭕️ ⭕️ ⭕️ ✖️ ✔️ ✖️ Sinhala ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Slovak ⭕️ ✔️ ✔️ ✖️ ✔️ ✔️ Slovenian ✔ ✔️ ✔️ ✖️ ✔️ ✔️ Sotho (Southern) ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Spanish ✔ ✔️ ✔️ ✔️ ✔️ ✔️ Swahili ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Swedish ✔ ✔️ ✔️ ✖️ ✔️ ✔️ Tagalog ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Tajik ⭕️ ✔️ ⭕️ ✖️ ✖️ ✖️ Tamil ⭕️ ✔️ ✔️ ✖️ ✖️ ✔️ Tatar ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Telugu ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Thai ✔ ✔️ ✔️ ✔️ ✖️ ✔️ Tibetan ⭕️ ✔️ ✔️ ✔️ ✔️ ✖️ Turkish ✔ ✔️ ⭕️ ✖️ ✖️ ✔️ Ukrainian ⭕️ ✔️ ⭕️ ✔️ ✔️ ✔️ Urdu ⭕️ ✔️ ⭕️ ✖️ ✖️ ✔️ Vietnamese ✔ ✔️ ⭕️ ✔️ ✖️ ✔️ Welsh ⭕️ ⭕️ ⭕️ ✖️ ✔️ ✖️ Yoruba ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Zulu ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✔️ Other Languages ⭕️ ⭕️ ⭕️ ✖️ ✖️ ✖️

✔: Supported
⭕️: Supported but falls back to the default English tokenizer
✖️: Not supported

Supported Text Types [Back to Contents]

You can specify your custom POS/Non-POS tags via Menu → Preferences → Settings → Tags.

Text Types Auto-detection Untokenized / Untagged ✔ Untokenized / Tagged (Non-POS) ✔ Tokenized / Untagged ✖ Tokenized / Tagged (POS) ✔ Tokenized / Tagged (Non-POS) ✖ Tokenized / Tagged (Both) ✔

Supported File Types [Back to Contents]

File Types File Extensions Text Files *.txt Microsoft Word Documents *.docx Microsoft Excel Workbook *.xls, *.xlsx CSV Files *.csv HTML Pages *.htm, *.html Translation Memory Files *.tmx Lyrics Files *.lrc

* Microsoft 97-03 Word documents (*.doc) are not supported.
* Non-text files will be converted to text files first before being added to the File Table. You can check the converted files under folder Import at the installation location of Wordless on your computer (as for macOS users, right click Wordless.app, select Show Package Contents and navigate to Contents/MacOS/Import/). You can change this location via Menu → Preferences → Settings → Import → Temporary Files → Default Path.

Supported File Encodings [Back to Contents]

Languages File Encodings Auto-detection All Languages UTF-8 Without BOM ✔ All Languages UTF-8 with BOM ✔ All Languages UTF-16 with BOM ✔ All Languages UTF-16 Big Endian Without BOM ✖ All Languages UTF-16 Little Endian Without BOM ✖ All Languages UTF-32 with BOM ✖ All Languages UTF-32 Big Endian Without BOM ✖ All Languages UTF-32 Little Endian Without BOM ✖ All Languages UTF-7 ✖ All Languages CP65001 ✖ Arabic CP720 ✖ Arabic CP864 ✖ Arabic ISO-8859-6 ✔ Arabic Mac OS Arabic ✖ Arabic Windows-1256 ✔ Baltic Languages CP775 ✖ Baltic Languages ISO-8859-13 ✖ Baltic Languages Windows-1257 ✖ Celtic Languages ISO-8859-14 ✖ Central European CP852 ✔ Central European ISO-8859-2 ✔ Central European Mac OS Central European ✔ Central European Windows-1250 ✔ Chinese GB18030 ✔ Chinese GBK ✖ Chinese (Simplified) GB2312 ✖ Chinese (Simplified) HZ ✔ Chinese (Traditional) Big-5 ✔ Chinese (Traditional) Big5-HKSCS ✖ Chinese (Traditional) CP950 ✖ Croatian Mac OS Croatian ✖ Cyrillic CP855 ✔ Cyrillic CP866 ✔ Cyrillic ISO-8859-5 ✔ Cyrillic Mac OS Cyrillic ✔ Cyrillic Windows-1251 ✔ English ASCII ✔ English EBCDIC 037 ✖ English CP437 ✖ Esperanto/Maltese ISO-8859-3 ✔ European HP Roman-8 ✖ French CP863 ✖ German EBCDIC 273 ✖ Greek CP737 ✖ Greek CP869 ✖ Greek CP875 ✖ Greek ISO-8859-7 ✔ Greek Mac OS Greek ✖ Greek Windows-1253 ✔ Hebrew CP856 ✖ Hebrew CP862 ✖ Hebrew EBCDIC 424 ✖ Hebrew ISO-8859-8 ✔ Hebrew Windows-1255 ✔ Icelandic CP861 ✖ Icelandic Mac OS Icelandic ✖ Japanese CP932 ✔ Japanese EUC-JP ✔ Japanese EUC-JIS-2004 ✖ Japanese EUC-JISx0213 ✖ Japanese ISO-2022-JP ✔ Japanese ISO-2022-JP-1 ✖ Japanese ISO-2022-JP-2 ✖ Japanese ISO-2022-JP-2004 ✖ Japanese ISO-2022-JP-3 ✖ Japanese ISO-2022-JP-EXT ✖ Japanese Shift_JIS ✔ Japanese Shift_JIS-2004 ✖ Japanese Shift_JISx0213 ✖ Kazakh KZ-1048 ✖ Kazakh PTCP154 ✖ Korean EUC-KR ✖ Korean ISO-2022-KR ✔ Korean JOHAB ✖ Korean UHC ✔ Nordic Languages CP865 ✖ Nordic Languages ISO-8859-10 ✔ North European ISO-8859-4 ✔ Persian/Urdu Mac OS Farsi ✖ Portuguese CP860 ✖ Romanian Mac OS Romanian ✖ Russian KOI8-R ✔ South-Eastern European ISO-8859-16 ✔ Tajik KOI8-T ✖ Thai CP874 ✖ Thai ISO-8859-11 ✖ Thai TIS-620 ✔ Turkish CP857 ✖ Turkish EBCDIC 1026 ✖ Turkish ISO-8859-9 ✔ Turkish Mac OS Turkish ✖ Turkish Windows-1254 ✖ Ukrainian CP1125 ✖ Ukrainian KOI8-U ✖ Urdu CP1006 ✖ Vietnamese CP1258 ✖ Western European EBCDIC 500 ✖ Western European CP850 ✖ Western European CP858 ✖ Western European CP1140 ✖ Western European ISO-8859-1 ✔ Western European ISO-8859-15 ✔ Western European Mac OS Roman ✖ Western European Windows-1252 ✔

Supported Measures [Back to Contents]

Measures of Dispersion & Adjusted Frequency

The dispersion and adjusted frequency of a word in each file is calculated by first dividing each file into n (5 by default) sub-sections and the frequency of the word in each part is counted, which are denoted by F₁, F₂, F₃ ... Fn. The total frequency of the word in each file is denoted by F. The mean value of the frequencies over all sub-sections is denoted by F-bar.

Then, the dispersion and adjusted frequency of the word will be calcuated as follows:

Measures of Dispersion Formulas Juilland's D [1] Juilland's D Carroll's D₂ [2] Carroll's D₂ Lyne's D₃ [3] Lyne's D₃ Rosengren's S [4] Rosengren's S Zhang's Distributional Consistency [5] Zhang's Distributional Consistency Gries's DP [6] Gries's DP Gries's DPnorm [6] [7] Gries's DPnorm Measures of Adjusted Frequency Formulas Juilland's U [1] Juilland's U Carroll's Um [2] Carroll's Um Rosengren's KF [4] Rosengren's KF Engwall's FM [8] Engwall's FM
where R is the number of sub-sections in which the word appears at least once Kromer's UR [9] Kromer's UR
where ψ is the digamma function, C is the Euler–Mascheroni constant

Tests of Statistical Significance & Measures of Effect Size

To calculate the statistical significance, bayes factor and effect size (except Student's t-test (Two-sample) and Mann-Whiteney U Test) for two words in the same file (collocates) or one specific word in two different files (keywords), two contingency tables must be constructed first, one for observed values, the other for expected values.

As for collocates (in Collocation and Colligation):

Observed Values Word 1 Not Word 1 Row Total Word 2 o11 o12 o1x Not Word 2 o21 o22 o2x Column Total ox1 ox2 oxx Expected Values Word 1 Not Word 1 Word 2 e11 e12 Not Word 2 e21 e22

o11: Number of occurrences of Word 1 followed by Word 2
o12: Number of occurrences of Word 1 followed by any word except Word 2
o21: Number of occurrences of any word except Word 1 followed by Word 2
o22: Number of occurrences of any word except Word 1 followed by any word except Word 2

As for keywords (in Keywords):

Observed Values Observed File Reference File Row Total Word w o11 o12 o1x Not Word w o21 o22 o2x Column Total ox1 ox2 oxx Expected Values Observed File Reference File Word w e11 e12 Not Word w e21 e22

o11: Number of occurrences of Word w in the observed file
o12: Number of occurrences of Word w in the reference file
o21: Number of occurrences of all words except Word w in the observed file
o22: Number of occurrences of all words except Word w in the reference file

To conduct Student's t-test (Two-sample) or Mann-Whiteney U Test on a specific word, the observed file and the reference file are first divided into n (5 by default) sub-sections respectively. Then, the frequencies of the word in each sub-section of the observed file and the reference file are counted and denoted by FO₁, FO₂, FO₃ ... FOn and FR₁, FR₂, FR₃ ... FRn respectively. The total frequency of the word in the observed file and the reference file are denoted by FO and FR respectively. The mean value of the frequencies over all sub-sections of the observed file and the reference file are denoted by FO-bar and FR-bar respectively.

Then the statistical significance, bayes factor and effect size will be calculated as follows:

Tests of Statistical Significance Formulas z-score [10][11] z-score Student's t-test (One-sample) [12] Student's t-test (One-sample) Student's t-test (Two-sample) [13] Student's t-test (Two-sample) Pearson's Chi-squared Test [14][15] Pearson's Chi-squared Test Log-likelihood Ratio [16] Log-likelihood Ratio Fisher's Exact Test [17] See: Fisher's exact test - Wikipedia Mann-Whiteney U Test [18] See: Mann–Whitney U test - Wikipedia Measures of Bayes Factor Formulas Student's t-test (Two-sample) [19] Student's t-test (Two-sample) Log-likelihood Ratio [19] Log-likelihood Ratio Measures of Effect Size Formulas Pointwise Mutual Information [20] Pointwise Mutual Information Mutual Dependency [21] Mutual Dependency Log-Frequency Biased MD [21] Log-Frequency Biased MD Cubic Association Ratio [22] Cubic Association Ratio MI.log-f [23][24] MI.log-f Mutual Information [25] Mutual Information Squared Phi Coefficient [26] Squared Phi Coefficient Dice's Coefficient [27] Dice's Coefficient logDice [28] logDice Mutual Expectation [29] Mutual Expectation Jaccard Index [25] Jaccard Index Minimum Sensitivity [30] Minimum Sensitivity Poisson Collocation Measure [31] Poisson Collocation Measure Kilgarriff's Ratio [32] Kilgarriff's Ratio
where α is the smoothing parameter, which is 1 by default.
You can change the value of α via Menu → Preferences → Settings → Measures →
Effect Size → Kilgarriff's Ratio → Smoothing Parameter
. Odds Ratio [33] Odds Ratio Log Ratio [34] Log Ratio Difference Coefficient [14][35] Difference Coefficient %DIFF [36] %DIFF

Works Cited [Back to Contents]

[1] Juilland, Alphonse and Eugenio Chang-Rodriguez. Frequency Dictionary of Spanish Words, Mouton, 1964.
[2] Carroll, John B. "An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index." Computer Studies in the Humanities and Verbal Behaviour, vol.3, no. 2, 1970, pp. 61-65.
[3] Lyne, A. A. "Dispersion." The Vocabulary of French Business Correspondence. Slatkine-Champion, 1985, pp. 101-24.
[4] Rosengren, Inger. "The quantitative concept of language and its relation to the structure of frequency dictionaries." Études de linguistique appliquée, no. 1, 1971, pp. 103-27.
[5] Zhang Huarui, et al. "Distributional Consistency: As a General Method for Defining a Core Lexicon." Proceedings of Fourth International Conference on Language Resources and Evaluation, Lisbon, 26-28 May 2004.
[6] Gries, Stefan Th. "Dispersions and Adjusted Frequencies in Corpora." International Journal of Corpus Linguistics, vol. 13, no. 4, 2008, pp. 403-37.
[7] Lijffijt, Jefrey and Stefan Th. Gries. "Correction to Stefan Th. Gries’ “Dispersions and adjusted frequencies in corpora”" International Journal of Corpus Linguistics, vol. 17, no. 1, 2012, pp. 147-49.
[8] Engwall, Gunnel. "Fréquence Et Distribution Du Vocabulaire Dans Un Choix De Romans Français." Dissertation, Stockholm University, 1974.
[9] Kromer, Victor. "A Usage Measure Based on Psychophysical Relations." Journal of Quatitative Linguistics, vol. 10, no. 2, 2003, pp. 177-186.
[10] Dennis, S. F. "The Construction of a Thesaurus Automatically from a Sample of Text." Proceedings of the Symposium on Statistical Association Methods For Mechanized Documentation, Washington, D.C., 17 March, 1964, edited by Stevens, M. E., et at., National Bureau of Standards, 1965, pp. 61-148.
[11] Berry-rogghe, Godelieve L. M. "The Computation of Collocations and their Relevance in Lexical Studies." The computer and literary studies, edited by Aitken, A. J., Edinburgh UP, 1973, pp. 103-112.
[12] Church, Kenneth Ward, et al. "Using Statistics in Lexical Analysis." Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, edited by Uri Zernik, Psychology Press, 1991, pp. 115-64.
[13] Paquot, Magali and Yves Bestgen. "Distinctive Words in Academic Writing: A Comparison of Three Statistical Tests for Keyword Extraction." Language and Computers, vol.68, 2009, pp. 247-269.
[14] Hofland, Knut and Stig Johansson. Word Frequencies in British and American English. Norwegian Computing Centre for the Humanities, 1982.
[15] Oakes, Michael P. Statistics for Corpus Linguistics. Edinburgh UP, 1998.
[16] Dunning, Ted Emerson. "Accurate Methods for the Statistics of Surprise and Coincidence." Computational Linguistics, vol. 19, no. 1, Mar. 1993, pp. 61-74.
[17] Pedersen, Ted. "Fishing for Exactness." Proceedings of the South-Central SAS Users Group Conference, 27-29 Oct. 1996, Austin.
[18] Kilgarriff, Adam. "Comparing Corpora." International Journal of Corpus Linguistics, vol.6, no.1, Nov. 2001, pp. 232-263.
[19] Wilson, Andrew. "Embracing Bayes Factors for Key Item Analysis in Corpus Linguistics." New Approaches to the Study of Linguistic Variability, edited by Markus Bieswanger and Amei Koll-Stobbe, Peter Lang, 2013, pp. 3-11.
[20] Church, Kenneth Ward and Patrick Hanks. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, vol. 16, no. 1, Mar. 1990, pp. 22-29.
[21] Thanopoulos, Aristomenis, et al. "Comparative Evaluation of Collocation Extraction Metrics." Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas, 29-31 May 2002, edited by Rodríguez, Manuel González Rodríguez and Carmen Paz Suarez Araujo, European Language Resources Association, May 2002, pp. 620-25.
[22] Daille, Béatrice. "Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering." UCREL Technical Papers, vol. 5, University of Lancaster, 1995.
[23] Kilgarriff, Adam and David Tugwell. "Word Sketch: Extraction and Display of Significant Collocations for Lexicography." Proceedings of the ACL 2001 Collocations Workshop, Toulouse, 2001, pp. 32–38.
[24] "Statistics used in Sketch Engine." Sketch Engine, https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/. Accessed 26 Nov 2018.
[25] Dunning, Ted Emerson. "Finding Structure in Text, Genome and Other Symbolic Sequences." Dissertation, U of Sheffield, 1998. arXiv, arxiv.org/pdf/1207.1847.pdf.
[26] Church, Kenneth Ward and William A. Gale. "Concordances for Parallel Text." Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, 29 Sept - 1 Oct 1991, UW Centre for the New OED and Text Research, 1991.
[27] Smadja, Frank, et al. "Translating Collocations for Bilingual Lexicons: A Statistical Approach." Computational Linguistics, vol. 22, no. 1, 1996, pp. 1-38.
[28] Rychlý, Pavel. "A Lexicographyer-Friendly Association Score." Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing, Karlova Studanka, 5-7 Dec. 2008, edited by Sojka, P. and A. Horák, Masaryk U, 2008, pp. 6-9.
[29] Dias, Gaël. "Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora." Proceedings of Conférence Traitement Au-tomatique des Langues Naturelles, 12-17 July 1999, Cargèse, edited by Mitkov, Ruslan and Jong C. Park, 1999, pp. 333-39.
[30] Pedersen, Ted. "Dependent Bigram Identification." Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, 26-30 July 1998, American Association for Artificial Intelligence, 1998, p. 1197.
[31] Quasthoff, Uwe and Christian Wolff. "The Poisson Collocation Measure and Its Applications." Proceedings of 2nd International Workshop on Computational Approaches to Collocations, Wien, Austria, 2002.
[32] Kilgarriff, Adam. "Simple Maths for Keywords." Proceedings of Corpus Linguistics Conference, Liverpool, 20-23 July 2009, edited by Mahlberg, M., et al., U of Liverpool, July 2009.
[33] Pojanapunya, Punjaporn and Richard Watson Todd. "Log-likelihood and Odds Ratio Keyness Statistics for Different Purposes of Keyword Analysis." Corpus Linguistics and Lingustic Theory, vol. 15, no. 1, Jan. 2016, pp. 133-67.
[34] Hardie, Andrew. "Log Ratio: An Informal Introduction." The Centre for Corpus Approaches to Social Science, http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/
[35] Gabrielatos, Costas. "Keyness Analysis: Nature, Metrics and Techniques." Corpus Approaches to Discourse: A Critical Review, edited by Taylor, Charlotte and Anna Marchi, Routledge, 2018.
[36] Gabrielatos, Costas and Anna Marchi. "Keyness: Appropriate Metrics and Practical Issues." Proceedings of CADS International Conference, U of Bologna, 13-14 Sept. 2012.

Documentation - Chinese (Simplified)

Editing...


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK