Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / Swing

A Simple String Compression Algorithm

4.75/5 (16 votes)
10 Jul 2011CPOL4 min read 188K   3.7K  
An efficient algorithm for encode 7-bit characters to 6 or 5 bits to reduce the size of a string

Introduction

Data compression is always useful for encoding information using fewer bits than the original representation it would use. There are many applications where the size of information would be critical. In data communication, the size of data can affect the cost too.

This algorithm was originally implemented for use in an SMS application. Using this algorithm, it could send about 256 characters per message (typically 160 characters per message) through the same 7-bit GSM network. The idea is, this program reduces the standard 7-bit encoding to some application specific 5-bit encoding system and then pack into a byte array. This method will reduce the size of a string considerably when the string is lengthy and the compression ratio is not affected by the content of the string.

The Algorithm

Encoding 

Let’s assume that we have a string with 8 characters (example: - “abcdefgh”). If we put this on a byte array, we get a byte array with the size of 8. The following diagram shows how these ASCII characters can store in an array.

|97|98|99|100|101|102|103|104|

A single character will need 8 bits if the characters are represented with ASCII. A set of 8 bits can represent 256 different characters. But if we consider the current application, a simple SMS might be included only around 26 different characters. Therefore it is enough to have 5-bit encoding which can give up to 32 different characters to represent. For convert to 5-bit, let’s assign new values to the above characters.

| a = 1 | b=2 | c=3 | d=4 | e=5 | f=6 | g=7 | h=8 |

If we look more closely at the new byte array, it will look like the following (the values of characters are in binary representation).

00000001|00000010|00000011|00000100|00000101|00000110|00000111|00001000|

But we still use 8 bytes for storing the 8 characters. In the next step, we will chop each byte from the position of 3rd bit from the left side and extract the 5 least significant bits. The result will be as follows:

|00001|00010|00011|00100|00101|00110|00111|01000|

We can rearrange these bits in an array of bytes as follows:

|00001000|10000110|01000010|10011000|11101000|

Now we have reduced 8 bytes to 5 bytes. The next section shows how these 5 bytes convert to the 8 bytes and get the original information.

Decoding

When an array of bytes is given, each byte should be represented in to binary. Then all the 1’s and 0’s should be arranged as their index and then can be split to the sets of five bits. After splitting, it will be as follows:

|00001  000|10  00011  0|0100  0010|1  00110  00|111  01000|

These sets can be converted to decimals and these values represent the characters that we have encoded.

|00001  = 1(a)

000|10  = 2 (b) 

00011   = 3(c) 

0|0100 = 4 (d)

0010|1 = 5 (e)

00110  = 6 (f) 

00|111 = 7 (g) 

01000| = 8 (h)  

Then the information can be decoded as “abcdefgh”.

Using the Code

Classes

class SixBitEnDec: The class that is responsible for encode and decode

Variables

final static public int SIX_BIT = 6; A constant for flag the operation as 6-bit conversion

final static public int FIVE_BIT = 5; A constant for flag the operation as 5-bit conversion

Functions

Java
byte[] encode(String txt, int bit){
	int length = txt.length();
	float tmpRet1=0,tmpRet2=0;
	if(bit==6){
		tmpRet1=3.0f;
		tmpRet2=4.0f;
	}else if(bit==5){
		tmpRet1=5.0f;
		tmpRet2=8.0f;
	}
	byte encoded[]=new byte[(int)(tmpRet1*Math.ceil(length/tmpRet2))];
	char str[]=new char[length];
	txt.getChars(0,length,str,0);
	int chaVal = 0;
	String temp;
	String strBinary = new String("");
	for (int i = 0;i<length; i++){
		temp = Integer.toBinaryString(toValue(str[i]));
		while(temp.length()%bit != 0){
			temp="0"+temp;
		}
		strBinary=strBinary+temp;
	}
	while(strBinary.length()%8 != 0){
	   strBinary=strBinary+"0";
	}
	Integer tempInt =new Integer(0);
	for(int i=0 ; i<strBinary.length();i=i+8){
		tempInt = tempInt.valueOf(strBinary.substring(i,i+8),2);
		encoded[i/8]=tempInt.byteValue();
	}
	return encoded;
}

This function is responsible for the whole encoding operation. The parameter txt can be any text that contains characters from English alphabet and the bit is the number of bits that is to be encoded. The value of the bit can be either 5 or 6. This function simply gets the relevant value of each character from the function toValue() and then get binary representation of each value. Then the code makes all these binary numbers to the length of 5 or 6 (according to the value of bit) by chopping the most significant bits or adding zeros in front of the numbers. Finally, arrange these 1s and 0s in a string and split them to the sets of 8 bits in order to store in a byte array. This byte array will be returned by this function as the encoded string.

Java
String decode(byte[] encoded, int bit){
	String strTemp = new String("");
	String strBinary = new String("");
	String strText = new String("");
	Integer tempInt =new Integer(0);
	int intTemp=0;
	for(int i = 0;i<encoded.length;i++){         
		if(encoded[i]<0){
			intTemp = (int)encoded[i]+256;
		}else
			intTemp = (int)encoded[i];
		strTemp = Integer.toBinaryString(intTemp);
		while(strTemp.length()%8 != 0){
			strTemp="0"+strTemp;
		}
		strBinary = strBinary+strTemp;
	}
	for(int i=0 ; i<strBinary.length();i=i+bit){
		tempInt = tempInt.valueOf(strBinary.substring(i,i+bit),2);
		strText = strText + toChar(tempInt.intValue()); 
	}
	return strText;
}

This function is responsible for the whole decoding operation. This function takes an array of bytes as the encoded data and the bit to switch the decoding to one of the 6-bit or 5- bit. In here also, the values in the array are converted to the binary representation and then converted into a string. After that, it will be split to the sets of 5 bits or 6 bits regarding the value of bit. Finally get the character that is relevant to the value from the function toChar() and append to a string. This string will be returned as the decoded string from the function.

Java
int toValue(char ch){
	int chaVal = 0;
	switch(ch){
		case' ':chaVal=0;break; case'a':chaVal=1;break;
		case'b':chaVal=2;break; case'c':chaVal=3;break;
		case'd':chaVal=4;break; case'e':chaVal=5;break;
		case'f':chaVal=6;break; case'g':chaVal=7;break;
		case'h':chaVal=8;break; case'i':chaVal=9;break;
		case'j':chaVal=10;break; case'k':chaVal=11;break;
		case'l':chaVal=12;break; case'm':chaVal=13;break;
		case'n':chaVal=14;break; case'o':chaVal=15;break;
		case'p':chaVal=16;break; case'q':chaVal=17;break;
		case'r':chaVal=18;break; case's':chaVal=19;break;
		case't':chaVal=20;break; case'u':chaVal=21;break;
		case'v':chaVal=22;break; case'w':chaVal=23;break;
		case'x':chaVal=24;break; case'y':chaVal=25;break;
		case'z':chaVal=26;break; case'.':chaVal=27;break;
		case'*':chaVal=28;break; case',':chaVal=29;break;
		case'\':chaVal=30;break; case'2':chaVal=31;break;
		case'A':chaVal=32;break; case'B':chaVal=33;break;
		case'C':chaVal=34;break; case'D':chaVal=35;break;
		case'E':chaVal=36;break; case'F':chaVal=37;break;
		case'G':chaVal=38;break; case'H':chaVal=39;break;
		case'I':chaVal=40;break; case'J':chaVal=41;break;
		case'K':chaVal=42;break; case'L':chaVal=43;break;
		case'M':chaVal=44;break; case'N':chaVal=45;break;
		case'O':chaVal=46;break; case'P':chaVal=47;break;
		case'Q':chaVal=48;break; case'R':chaVal=49;break;
		case'S':chaVal=50;break; case'T':chaVal=51;break;
		case'U':chaVal=52;break; case'V':chaVal=53;break;
		case'W':chaVal=54;break; case'0':chaVal=55;break;
		case'1':chaVal=56;break; case'3':chaVal=57;break;
		case'4':chaVal=58;break;case'5':chaVal=59;break;
		case'6':chaVal=60;break;case'7':chaVal=61;break;
		case'8':chaVal=62;break;case'9':chaVal=63;break;
		default:chaVal=0;
	}
	return chaVal;
}

This function returns a value for the given character in the alphabet. These values are lower than 64 so 6 bit number can represent any of these characters:

Java
char toChar(int val){
	char ch = ' ';
	switch(val){
		case 0:ch=' ';break; case 1:ch='a';break;
		case 2:ch='b';break; case 3 :ch='c';break;
		case 4:ch='d';break; case 5 :ch='e';break;
		case 6:ch='f';break; case 7 :ch='g';break;
		case 8:ch='h';break; case 9 :ch='i';break;
		case 10:ch='j';break; case 11:ch='k';break;
		case 12:ch='l';break; case 13:ch='m';break;
		case 14:ch='n';break; case 15:ch='o';break;
		case 16:ch='p';break; case 17:ch='q';break;
		case 18:ch='r';break; case 19:ch='s';break;
		case 20:ch='t';break; case 21:ch='u';break;
		case 22:ch='v';break; case 23:ch='w';break;
		case 24:ch='x';break; case 25:ch='y';break;
		case 26:ch='z';break; case 27:ch='.';break;
		case 28:ch='*';break; case 29:ch=',';break;
		case 30:ch='\';break; case 31 :ch='2';break;
		case 32:ch='A';break; case 33:ch='B';break;
		case 34:ch='C';break; case 35:ch='D';break;
		case 36:ch='E';break; case 37:ch='F';break;
		case 38:ch='G';break; case 39:ch='H';break;
		case 40:ch='I';break; case 41:ch='J';break;
		case 42:ch='K';break; case 43:ch='L';break;
		case 44:ch='M';break; case 45:ch='N';break;
		case 46:ch='O';break; case 47:ch='P';break;
		case 48:ch='Q';break; case 49:ch='R';break;
		case 50:ch='S';break; case 51:ch='T';break;
		case 52:ch='U';break; case 53:ch='V';break;
		case 54:ch='W';break; case 55:ch='0';break;
		case 56:ch='1';break; case 57:ch='3';break;
		case 58:ch='4';break; case 59:ch='5';break;
		case 60:ch='6';break; case 61:ch='7';break;
		case 62:ch='8';break; case 63:ch='9';break;
		default:ch=' ';
	}
	return ch;
}

This function returns the character that relates to the given number as defined in this code. The number that assigned to the character is not important but each number must be able to represent by 6 bits.

How to Use

This program is demonstrating the use of class SixBitEnDec using a simple interface. You can insert some texts and observe the size of the text after compressing and also the text after decompressing.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)