1. Advanced String Processing
After dealing with basic data types around characters and strings in another chapter, we will now discuss advanced string processing. The topics are formatted output, regular expressions, string splitting.
Prerequisites
be able to format strings
be able to match, search, replace with regular expressions
be able to split strings
understand character encodings and UTF-8
Data types used in this chapter:
1.1. Format strings
To format strings, numbers, and temporal data as a string, there are different ways to do it in Java. In the package java.text
you can find the classes MessageFormat
, DateFormat
and DecimalFormat
as well as the class Formatter
and in String
the method String.format(…)
. The next tasks can be solved using the formatting strings from Formatter
in a rather easy way.
1.1.1. Build ASCII table ⭐
Bonny Brain has installed a new app on her Aye Phone that shows her the ASCII alphabet:
$ ascii Usage: ascii [-adxohv] [-t] [char-alias...] -t = one-line output -a = vertical format -d = Decimal table -o = octal table -x = hex table -b binary table -h = This help screen -v = version information Prints all aliases of an ASCII character. Args may be chars, C \-escapes, English names, ^-escapes, ASCII mnemonics, or numerics in decimal/octal/hex. Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex 0 00 NUL 16 10 DLE 32 20 48 30 0 64 40 @ 80 50 P 96 60 ` 112 70 p 1 01 SOH 17 11 DC1 33 21 ! 49 31 1 65 41 A 81 51 Q 97 61 a 113 71 q 2 02 STX 18 12 DC2 34 22 " 50 32 2 66 42 B 82 52 R 98 62 b 114 72 r 3 03 ETX 19 13 DC3 35 23 # 51 33 3 67 43 C 83 53 S 99 63 c 115 73 s 4 04 EOT 20 14 DC4 36 24 $ 52 34 4 68 44 D 84 54 T 100 64 d 116 74 t 5 05 ENQ 21 15 NAK 37 25 % 53 35 5 69 45 E 85 55 U 101 65 e 117 75 u 6 06 ACK 22 16 SYN 38 26 & 54 36 6 70 46 F 86 56 V 102 66 f 118 76 v 7 07 BEL 23 17 ETB 39 27 ' 55 37 7 71 47 G 87 57 W 103 67 g 119 77 w 8 08 BS 24 18 CAN 40 28 ( 56 38 8 72 48 H 88 58 X 104 68 h 120 78 x 9 09 HT 25 19 EM 41 29 ) 57 39 9 73 49 I 89 59 Y 105 69 i 121 79 y 10 0A LF 26 1A SUB 42 2A * 58 3A : 74 4A J 90 5A Z 106 6A j 122 7A z 11 0B VT 27 1B ESC 43 2B + 59 3B ; 75 4B K 91 5B [ 107 6B k 123 7B { 12 0C FF 28 1C FS 44 2C , 60 3C < 76 4C L 92 5C \ 108 6C l 124 7C | 13 0D CR 29 1D GS 45 2D - 61 3D = 77 4D M 93 5D ] 109 6D m 125 7D } 14 0E SO 30 1E RS 46 2E . 62 3E > 78 4E N 94 5E ^ 110 6E n 126 7E ~ 15 0F SI 31 1F US 47 2F / 63 3F ? 79 4F O 95 5F _ 111 6F o 127 7F DEL
However, their Aye Phone doesn’t have such a widescreen, and the first two blocks are not visible characters anyway.
Task:
Write a program that prints all ASCII characters from position 32 to 127 in the same formatting as the Unix program
ascii
does.At position 127, write
DEL
.
1.1.2. Aligned outputs ⭐
Captain CiaoCiao needs a table of the following type for a listing:
Dory Dab paid Bob Banjo paid Cod Buri paid Bugsy not paid
Task:
Write a method
printList(String[] names, boolean[] paid)
that prints a collection on the screen. The first string array contains all names, the second array contains information whether the person has paid or not.All names can be of different lengths, but the texts in the second column should be aligned.
The longest string in the first column has a distance of four spaces to the second column.
If passed arrays are
null
, aNullPointerException
must be thrown.
1.2. Regular expressions and pattern recognition
Regular expressions are a curse and a blessing for many. Used incorrectly, they lead to programs that are impossible to read later, used correctly, they shorten the program massively and contribute to clarity.
The next exercises will show that we can use regular expressions to test whether
a string completely matches a regular expression,
a partial string exists and if so, find out where, and then replace it.
Later, we will also use regular expressions to specify separators and decompose strings.
1.2.1. Quiz: Define regex ⭐
What does the regex look like to match or find the following?
a string of exactly 10 digits
a string of 5 to 10 digits and letters
a string ending in
.
,!
or?
like a sentencea non-empty string that does not contain digits
an official title or a name title:
Prof.
,Dr.
,Dr. med.
,Dr. h.c.
in the string
1.2.2. Determine popularity in social media ⭐
Of course Captain CiaoCiao is active on social media, his identifier is #CaptainCiaoCiao
or @CaptainCiaoCiao
.
Now Captain CiaoCiao wants to know how popular he is.
Task:
Given an aggregated text with messages; how often does
#CaptainCiaoCiao
or@CaptainCiaoCiao
occur there?
Example:
For the following input, the result is 2.
Make me a baby #CaptainCiaoCiao Hey @CaptainCiaoCiao, where is the recruitment test? What is a hacker’s favorite pop group? The Black IP’s.
1.2.3. Detect scanned values ⭐
Bonny Brain receives scanned lists of numbers that need to be processed electronically. In the first step, it sends the scans through an OCR recognizer, and the result is ASCII text. The numbers from the OCR recognition always look like this:
000 11 22 333 4 4 5555 6 77777 888 9999 0 00 111 2 2 3 4 4 5 6 7 8 8 9 9 0 0 0 11 2 33 4444 555 6666 7 888 9999 00 0 11 2 3 4 5 6 6 7 8 8 9 000 11l1 2222 333 4 555 666 7 888 9
Task:
Given is a line from the scan with numbers from the format shown. Convert the numbers to an integer.
There could be missing spaces after the last digit, and there could be several spaces between the large characters.
Example:
If the string (written in the text block syntax) is
String ocr = """ 4 4 77777 11 11 4 4 22 4 4 7 111 111 4 4 2 2 4444 7 11 11 4444 2 4 7 11 11 4 2 4 7 11l1 11l1 4 2222""";
so the desired result should be
471142
.
If you want to play around with the strings, you can find a way at https://patorjk.com/software/taag/#p=display&f=Alphabet&t=0123456789.
1.2.4. Quiet please! Defuse shouting texts ⭐
Captain CiaoCiao often gets letters, and the senders often SCREAM in capital letters — how nasty for his eyes.
Task:
Write a method
String silentShoutingWords(String)
that converts all capitalized words over three letters in length to lowercase.
Example:
silentShoutingWords("AY Captain! Smutje MUST GO!")
→"AY Captain! Smutje must GO!"
1.2.5. Converting time from AM/PM format to 24-hour format ⭐⭐
Bonny Brain frequently gets messages where the time is given in AM and PM.
We raid the harbor at 11:00 PM and meet on the amusement mile at 1:30 AM.
Bonny Brain doesn’t like that, she wants only the 24-hour count of Military Time.
Task:
Write a converter that converts strings with AM/PM (case-insensitive, even with periods) to Military Time. As a reminder, 12:00 AM is 00:00, and 12:00 PM is 12:00.
Examples:
"Harbour: 11:00 PM, entertainment districts: 1:30 a.m.!"
→Harbour: 2300, entertainment districts: 0130!"
"Get out of bed: 12:00AM, bake a cake: 12 PM."
→"Get out of bed: 0000, bake a cake: 1200"
1.3. Decompose strings into tokens
Tokenizing is the opposite of constructing and formatting a string. A string is split into substrings, with separators determining the separation points. Java provides various classes for tokenizing strings and input. The separators can be symbols or strings described by regular expressions. The split(…)
method of the String
class works with regular expressions just like the Scanner
, the StringTokenizer
class does not.
1.3.1. Split address lines with the StringTokenizer ⭐
The software of Captain CiaoCiao has to evaluate addresses consisting of three or four lines.
The meaning of the lines
line | content | optional |
---|---|---|
1 | name | no |
2 | street | no |
3 | city | no |
4 | country | yes |
The lines are separated with a line break. There are four valid separator symbols or sequences:
character (abbreviation) | decimal | hexadecimal | escape sequence |
---|---|---|---|
LF |
|
|
|
CR |
|
|
|
CR LF |
|
|
|
LF CR |
|
|
|
LF is the abbreviation for "line feed" and CR for "carriage return"; in old teleprinters, CR moved the carriage to the first column, and LF pushed the paper up.
Traditionally, DOS and Microsoft Windows use the combination \r\n
, while Unix systems use \n
.
Task:
Break a newline-separated string into four lines, and assign the lines to the variables
name
,street
,city
,country
.If a fourth line with the country name is not given, let
country
be"Drusselstein"
.Reassemble the line as a CSV line separated by semicolons.
Examples:
"Boots and Bootles\n21 Pickle Street\n424242 Douglas\nArendelle"
→Boots and Bootles;21 Pickle street;424242 Douglas;Arendelle
"Doofenshmirtz Evil Inc.\nStrudelkuschel 4427\nDanville"
→Doofenshmirtz Evil Inc.;Strudelkuschel 4427;Gimmelshtump;Drusselstein
1.3.2. Split sentences into words and reverse them ⭐
Bonny Brain is waiting for a message, but something went wrong with the transmission — all the words are reversed!
erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!
Task:
Break the string into words. Separators of words are spaces and punctuation marks.
Turn over all the words one by one.
Output the words one after the other, separated by a space. The punctuation marks and other separators are not critical.
Example:
"erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!"
→"Where did the pirate purchase his hook At the hand second shop"
1.3.3. Check relations between numbers ⭐
Captain CiaoCiao practices archery, and he records the scores from 0 to 10 in a list. He also notes if he got better, worse, or if the score stays the same. It may look like this:
1 < 2 > 1 < 10 = 10 > 2
Goldy Goldfish has the task of checking the relation signs <
, >
and =
.
Task:
Write a program that gets a string like the one in the example and returns
true
if all relation signs are correct, andfalse
otherwise.
Examples:
1 < 2 > 1 < 10 = 10 > 2
→true
1 < 1
→false
1 <
→false
1
→true
1.3.4. Convert A1 notation to columns and rows ⭐⭐
Captain CiaoCiao keeps records of his loot and uses spreadsheets. With his staff, he discusses the figures, and to address the cells he uses the column and row index; for example, he says 4-16
to refer to the 4th column and the 16th row. Now he has heard of a whole new way to name cells, A1 notation, which uses a new kind of software called ECKSEL. This involves coding the column with letters from A
to Z
, according to the following scheme:
A, B, …, Z, AA, …, AZ, BA, …, ZZ, AAA, AAB, …
The lines are still described with numbers. Thus, A2 stands for cell 1-2.
Since Captain CiaoCiao has its difficulties with A1 notation, the specification is to be converted back to numeric columns and rows.
Task:
Write a method
parseA1Notation(String)
that gets astring
in A1 notation and returns an array with two elements, in which at position0
is the column and at position1
is the row.
Examples:
parseA1Notation( "A1" )
→[1, 1]
parseA1Notation( "Z2" )
→[26, 2]
parseA1Notation( "AA34" )
→[27, 34]
parseA1Notation( "BZ" )
→[0, 0]
parseA1Notation( "34" )
→[0, 0]
parseA1Notation( " " )
→[0, 0]
parseA1Notation( "" )
→[0, 0]
1.3.5. Parse simple CSV files with coordinates ⭐
Bonny Brain notes the locations with loot in a CSV file coordinates.csv, where the coordinates are floating-point numbers separated by commas.
For example, the file looks like this:
20.091612,-155.676695 23.087301,-73.643472 21.305452,-71.690421
Task:
Create a CSV file manually. It should contain several lines with coordinates; a comma separates the coordinates.
A Java program should read the CSV file and output an HTML file with SVG for the polygon course on the screen.
Use the class
Scanner
to parse the file. Make sure to initialize theScanner
withuseLocale(Locale.ENGLISH)
if your locale is not English.
Example: For the upper block, we want to create
<svg height="210" width="500">
<polygon points="20.091612,-155.676695 23.087301,-73.643472 21.305452,-71.690421 " style="fill:lime;stroke:purple;stroke-width:1" />
</svg>
1.3.6. Compress strings lossless by runlength encoding ⭐⭐⭐
To reduce the volume of data, files are often compressed. There are different compression algorithms; some are lossy like vowel removal, others work without loss like ZIP. Lossy compression is found in images, JPEG is a good example. Depending on the degree of compression, the image quality degrades. In JPEG, very high compression results in an image with strong artifacts.
A simple lossless compression is run-length encoding. The idea is to combine a sequence of identical symbols so that only the number and the symbol are written. The graphic format GIF, for example, uses this form of compression. Therefore, images with many monochrome lines are also smaller than, for example, images in which each pixel has a different color.
The next task is about run-length encoding. Suppose a string consists of a sequence of .
(dot) and -
(minus sign), such as:
--....--------..-
To shorten the length of strings, we can first write the symbol followed by the number of symbols. The string with 17 characters could be shortened to the following string with 9 characters:
-2.4-8.2-
Task:
Create a new class
SimpleStringCompressor
.Write a static method
String compress(String)
that encodes sequences of.
and-
according to the described algorithm: First comes the character, then the number.Write a decoder
String decompress(String)
that unpacks the compressed string. Letdecompress(compress(input))
be equal toinput
.
Extensions:
The program shall be able to handle all non-digits.
Refine the program so that the number is omitted if the character occurs only exactly once.