public class ChineseUtils extends Object
Modifier and Type | Field and Description |
---|---|
static int |
ASCII |
static int |
DELETE |
static int |
DELETE_EXCEPT_BETWEEN_ASCII |
static int |
FULLWIDTH |
static int |
LEAVE |
static int |
MAX_LEGAL |
static String |
MID_DOT_REGEX_STR |
static int |
NORMALIZE |
static String |
NUMBERS |
static String |
ONEWHITE |
static String |
WHITE |
static String |
WHITEPLUS |
Modifier and Type | Method and Description |
---|---|
static boolean |
isNumber(char c) |
static void |
main(String[] args)
Mainly for testing.
|
static String |
normalize(String in) |
static String |
normalize(String in,
int ascii,
int spaceChar) |
static String |
normalize(String in,
int ascii,
int spaceChar,
int midDot)
This will normalize a Unicode String in various ways.
|
static String |
shapeOf(CharSequence input,
boolean augmentedDateChars,
boolean useMidDotShape) |
public static final String ONEWHITE
public static final String WHITE
public static final String WHITEPLUS
public static final String NUMBERS
public static final String MID_DOT_REGEX_STR
public static final int LEAVE
public static final int ASCII
public static final int NORMALIZE
public static final int FULLWIDTH
public static final int DELETE
public static final int DELETE_EXCEPT_BETWEEN_ASCII
public static final int MAX_LEGAL
public static boolean isNumber(char c)
public static String normalize(String in, int ascii, int spaceChar, int midDot)
in
- The String to be normalizedascii
- For characters conceptually in the ASCII range of
! through ~ (U+0021 through U+007E or U+FF01 through U+FF5E),
if this is ChineseUtils.LEAVE, then do nothing,
if it is ASCII then map them from the Chinese Full Width range
to ASCII values, and if it is FULLWIDTH then do the reverse.spaceChar
- For characters that satisfy Character.isSpaceChar(),
if this is ChineseUtils.LEAVE, then do nothing,
if it is ASCII then map them to the space character U+0020, and
if it is FULLWIDTH then map them to U+3000.midDot
- For a set of 7 characters that are roughly middle dot characters,
if this is ChineseUtils.LEAVE, then do nothing,
if it is NORMALIZE then map them to the extended Latin character U+00B7, and
if it is FULLWIDTH then map them to U+30FB.public static void main(String[] args) throws IOException
ChineseUtils ascii spaceChar word*
ascii and spaceChar are integers: 0 = leave, 1 = ascii, 2 = fullwidth.
The words listed are then normalized and sent to stdout.
If no words are given, the program reads from and normalizes stdin.
Input is assumed to be in UTF-8.args
- Command line arguments as aboveIOException
- If any problems accessing command-line filespublic static String shapeOf(CharSequence input, boolean augmentedDateChars, boolean useMidDotShape)