|
hello
i have a file with only entry of 0..9 (numbers only) without any other thing even space , what is the most efficient algorithm to compress it?
thanks in advance.
|
|
|
|
|
at the command level, use a ZIP utility, such as WinZip.
When programming an application, create a binary file. Assuming all digit values have same probability, you could:
- create a binary file with BinaryWriter class; it will be filled with ulong values;
- the first value to write would be the number of digits to be stored;
- then have a loop that reads 19 digits, turns them into a single ulong (use ulong.Parse or TryParse), and outputs that ulong in binary;
- when reaching the end of the input data, which probably did not have a length evenly divisible by 19, just imagine the few missing digits are all zero, and generate the last ulong as you did all the others.
That is it; you have turned N characters (probably bytes in an ASCII text file) into a sequence of N/19*8 bytes, giving a compression factor of 2.37
You could do better if something was known about the distribution of the digits; otherwise that is about the best you could get.
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
|
|
|
|
|
No, a compression factor of 2.37 for that purpose is not good at all. Just send a text file to a "zip compressed" folder in Windows explorer, and you get a compression ratio of about 10. And, of course, the text file contains not only digits but other ASCII characters also, which means that it already contains much more information per byte of storage, than does the file with digits only.
Hence I'd take some common zip algorithm. The "ICSharpZipLib" is also available for commercial products (see http://www.sharpdevelop.net/OpenSource/SharpZipLib/[^]).
In the Unix world, I sometimes see "bz2" compressed files which are said to be extremely compressed; but I do not know a library to handle that with .NET.
|
|
|
|
|
Incorrect. You either believe in magic, or you didn't understand my suggestion.
So I created a little proof:
public void generateFiles() {
using (TextWriter tw=File.CreateText("test.txt")) {
using (BinaryWriter bw=new BinaryWriter(File.Create("test.dat"))) {
int DIM=100*1000;
Random r=new Random();
for (int i=0; i<DIM; i++) {
ulong n=0;
for (int j=0; j<19; j++) {
int d=r.Next(10);
tw.Write("01234567890".Substring(d,1));
n=10*n+(ulong)d;
}
bw.Write(n);
}
}
}
}
and I WinZipped the txt file; the resulting sizes were:
.txt = 1,856 KB
.dat = 782 KB (that is a compression factor of 2.37 indeed)
.zip = 872 KB (that is 2.13)
which tells me WinZip is wasting around 10%.
While ZIP normally achieves around a factor of 10, that simply can not be achieved when the data is really random and has log2(10) significant bits in each 8-bit byte (if it could, it would store a digit in less than one bit which is clearly impossible)
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
|
|
|
|
|
That is, the high compression rate of Windows zip with texts like log files is caused by the repetitive structure of the contents, though some 6 out of 8 bits per byte are used for the text.
With log2(10) significant bits per byte, the theoretical limit should then be a compression factor of 2.408, thus your idea already achieves some 98.5% of it.
Let's see how much more number crunching and execution time will be required to get it one percent better.
|
|
|
|
|
one could perform my method with a wider integer; with a 256-bit integer the compression factor would be 77/32=2.406
The BigInteger class could be helpful to achieve that and more. I didn't mention it earlier as I don't care about the last few percents.
Luc Pattyn [Forum Guidelines] [My Articles] Nil Volentibus Arduum
Please use <PRE> tags for code snippets, they preserve indentation, improve readability, and make me actually look at the code.
|
|
|
|
|
What's the strategy of backup programs when files are in use?
If you back up a file during use, your backup may end up useless because of inconsistent data.
So, should a backup program should attempt exclusive access on every file it backs up?
And if it can't gain exclusive access, maybe come back to that file later?
The difficult we do right away...
...the impossible takes slightly longer.
|
|
|
|
|
That is the way most backup systems work. If any files still cannot be accessed after some predetermined number of retry attempts then the program would report that fact in its log.
I must get a clever new signature for 2011.
|
|
|
|
|
As you have pointed-out; a backup of files alone is not enough to guarantee synchronisation on restore. Even if all files can be accessed there is still the time-sync/event-sync problem.
A backup from which an app can restore itself is usually called a checkpoint. Such a "backup" would involve a transaction history file and a log file, as well as a "checkpoint" copy of pertinent files. Such backups are always built-into the app itself. The restore phase restores the files and then runs the app forward using the transaction history elements as input. The app must be suspended for the time it takes to perform the checpoint. There are techniques for minimising the timespan of the suspension, such as file twinning.
A system-wide backup is not in my ken for today's platforms. However, anti-malware apps perform system-wide scans all the time, so I don't think file access is an issue, here.
So what kind of backup do you have in mind?
Tadeusz Westawic
Sumi quid sum.
|
|
|
|
|
I want to perform a standard type of file backup where the files are compressed and encrypted into a single backup file.
I think I'll simply keep a log file of which files I couldn't open for exclusive access, and leave it to the user to close whatever application is using the file.
The difficult we do right away...
...the impossible takes slightly longer.
|
|
|
|
|
This may sound very simple, but these parts of my brain are a bit rusty from disuse. I need to report on insurance policies, stating whether a policy is current, in advance/credit, or behind/in arrears. Am I correct in presuming it is as simple as below, providing the monthly payments remain constant?
Received = Tally of all monthly payments on the policy.
Expected = Policy age in months x monthly premium.
If Received < Expected, policy is in arrears.
If Received > Expected, policy is paid in advance.
If Received == Expected, policy is current.
|
|
|
|
|
Yes.
Will Rogers never met me.
|
|
|
|
|
Unless I am missing something, don't you have monthly payment object where you track due date, payment day, payment amount, payment made etc. If so you should be able to query month by month.
If you are trying to do a quick check, then you logic is correct. I am assuming you meant
Received = Tally of all monthly payments on the policy [for the policy period]
|
|
|
|
|
Thing is, even if this month, I received the full monthly payment, before the due date, the policy will be current for this month, but if I missed last month's payment, it is overall still in arrears, unless I include a brought-forward amount. I think it's more accurate and practical to sum all payments made on the policy, instead of implementing an extra brought-forward field, and having to keep this up to date at the end of each month. This would in fact required a monthly run, where this system is only intended for MIS and on demand, not monthly, reports.
I don't understand your qualification [for the policy period] . There surely should not be any payments for policy outside the policy period?
|
|
|
|
|
How does the business define being in arrears?
My guess is that you should use the expected sum to date compared with the actual sum to date.
So I think you are right - it is that simple.
Regards
David R
---------------------------------------------------------------
"Every program eventually becomes rococo, and then rubble." - Alan Perlis
The only valid measurement of code quality: WTFs/minute.
|
|
|
|
|
I don't have any definition from the business, and I haven't been able to elicit one yet, so for the first, prototype release I'm goinf with arrears == expected > actual.
|
|
|
|
|
Brady Kelly wrote: I don't understand your qualification [for the policy period]. There surely should not be any payments for policy outside the policy period?
Here is my understanding here in US. This is personal experience and I have not worked in the insurance industry.
I've a car insurance, which is my policy. I happen to buy it on the month of December. So my policy period runs from Dec'10 to Dec'11 for the current year. So, when they are calculating my payment to figure out the status they only need to look into the policy period . Last years policy period is irrelevant. Of course by the end of the policy period last year, I must have been current otherwise the insurance company would not have renewed it for this year.
Brady Kelly wrote: Thing is, even if this month, I received the full monthly payment, before the due date, the policy will be current for this month, but if I missed last month's payment, it is overall still in arrears, unless I include a brought-forward amount.
I agree, in fact if you go back and see my original post, what I said was this method will give you a quick way of checking the status of the policy. But you won't be able to know why it is so.
|
|
|
|
|
Looks like what I'm working on is life, not auto, so it doesn't have a yearly renewal, but a payout date, so it's actually assurance, but the 'spec' is in Afrikaans, which I read and speak well, but doesn't translate that well into a business spec for me, and I'm not in direct contact with the user. I'm working through an intermediary who has not been very available this week.
|
|
|
|
|
Ah, the plot thickens!
My first answer assumed the whole scenario was as simple as explained. But in most businesses, and in accounting generally, an aging report breaks amounts into groups - Current, 30 days, 60 days, 90 days, and >90 days in arrears. This is very likely what your client wants to see. Of course, it's a lot more work, but it gives a much better report.
Will Rogers never met me.
|
|
|
|
|
Roger Wright wrote: Current, 30 days, 60 days, 90 days, and >90 days
No, it's not the actual finance software, it's just a sort of basic 'dashboard' for the broker. They probably use an accounting package for the real aging, the brokers probably use a pen. I'll leave room for more detail though - take this baby Enterprise. Mwaahaaaahaaa.
|
|
|
|
|
Brady Kelly wrote: just a sort of basic 'dashboard'
Oh, then I'll stick with my original answer - comparing the sums is a good way to do it.
Will Rogers never met me.
|
|
|
|
|
I'm a little tired of staring at these few lines and not seeing the problem. Take this code:
volatile int WorkingBuff = 0;
volatile int NextBuff = 1;
volatile int RenderBuff = 1;
void stuff()
{
NextBuff = WorkingBuff;
int usedId = 0;
do
{
usedId = RenderBuff;
for(int i=0; i<3; i++)
{
if(i != NextBuff && i != usedId)
WorkingBuff = i;
}
} while(RenderBuff != usedId);
}
void render()
{
RenderBuff = NextBuff;
}
Could anybody point out the situation where WorkingBuff and RenderBuff have the same value? I'm just not seeing it.
|
|
|
|
|
1. NextBuff = WorkingBuff; in Thread A.
2. Thread A is blocked and Thread B starts.
3. RenderBuff = NextBuff; in Thread B.
|
|
|
|
|
Sorry, I should have elaborated a bit. I mean after the do-while loop had run.
I'm having a problem it seems, where I'm trying to render with stuff I'm in the middle of updating.
|
|
|
|
|
You could use lock () around the places you absolutely don't want to execute concurrently. This would remove all the uncertainty and make your code more reliable.
|
|
|
|
|