Introduction
I decided to write this article after days of coding and finding resources and information on this site, so here it is, a tool I'm using almost everyday in my system admin life.
This article will present a tool coded in C# that will scan folders (UNC, local drive/path) and store data in a SQL Database the following informations :
- Name
- Size
- datetime info (creation, access, change)
- Owner
- ACLs (on folder only) and Rights inheritance
this data once stored is used to calculate a simple MS OLAP application (MS Analysis Services 2008) that will consolidate file number and filesize on multiple dimension, recreating the folder tree and then can be used in many way. Excel pivot-table connected to the is a greatway to access some of this data.
Real life case, I'm using this soft to scan weekly the file servers where I'm working. There are 22 network share scanned for a grand total of 1.4 Milions folders and 15 Million files. The scan run in a command line on the server hosting the database for convenience purpose and take around 25-26 hours to run. (the scan is multithread and run on the 22 share simultaneously)
Background
Why remake a treesize when there are other treesize applications out there?
First this project started about 6 years ago when I was confronted with a simple problem : there are too many files on our company file servers! what can we do ?
Treesize just didn't cut it. it was the time I was playing with VB6 and implementing essbase servers so I was begining to understand what OLAP meant, and looked for MS solution. I found the filescan I was developping with a SQL DB was quite well suited for a OLAP Cube with parent-child dimension (which never worked on essbase...)
That it allowed me to analyse in a much efficient way how files were stored by the users.
since then the code evolved, changed, I integrated the ACLs in the scan, the code migrated to c#, then I implemented multi-threading due to the number of file servers growing, lastly I migrated to .NET 3.5, tasted LINQ-to-SQL and adopted it.
Using the code
Prerequisites
to run this code, there quite a few things needed
- VS2008/.net 3.5
- SQL Server 2005 Developper or Enterprise Edition (or 2008)
- Analysis Services
Preparing the environment
In the zip file, you should find a 'CreateDB.sql' script, edit it and change all the occurences of
D:\MSSQL\MSSQL.1\MSSQL\DATA\
by the desired data Path (path must exists before running the script).
then create the OLAP application by launching the file 'TreeAnalysis.xmla'
You need to change the SQL Connection in the OLAP Application, go in the "Data Sources", and modify the connection string to connect to your SQL Server.
you can compile and run the code
Example of command line
DiskAnalysis.exe -s:localhost -d:DiskReport -o:localhost -a:TreeAnalysis
DiskAnalysis.exe -s:localhost -d:DiskReport -o:localhost -a:TreeAnalysis -l:3 -t:4
Database
The Database is pretty simple
- a file is in a folder and one only, a file has one owner (SID)
- a folder is in a folder and one only, a folder has multiple folder, multiple file, multiple ACL
- that gives a schema looking like the linq-to-sql just below
- the table SID store the SID, which are unique and the translated name if translation was possible
- Table "Rights" does approximatively the same thing but for numeric value the API give for the ACL.
that was basic info on the database, more fun stuff in the POI section
The table TreeDetail store the rootFolders to be scanned, Exemple :
You will need to insert the folders manually here as I did not make (yet) an admin interface
LINQ-to-SQL Class - DiskReport.dbml
This is the class linking the database to the code.
Thanks to that I was able to remove about 200 lines of SQL query in the scanning process.
Thanks to VS2008, I just drag and dropped my database object from the server explorer, and voila!
I made this class a stand alone class library because there are multiple tools which are using this database, like a filesearcher and a "user manager for domain" showing which folder a domain user has access to, but that's another article.
DiskAnalysis
the stuff
1 - Program.cs
Right off the bat, I start with 2 public static members
public static Semaphore _pool;
public static System.Data.SqlClient.SqlConnectionStringBuilder sb;
_pool
is going to be the limiter of the multithread scan
sb
is the easy way to give all my thread the same database information, there may be nicer way, but I felt it was weighting the code to give db connection information to the object running the thread.
Once the command line parameters are scanned I intialize the the SqlConnectionStringBuilder:
sb = new System.Data.SqlClient.SqlConnectionStringBuilder();
sb.DataSource = servername;
sb.InitialCatalog = database;
sb.IntegratedSecurity = true;
then the program will start the scan, once the scan is finished I do some clean up and translation in the ACLs and SIDs scanned, finally I launch a full process of the OLAP database.
1.1 - Preparation of the scan : start(int level, int maxThread)
if(maxThread > 0) _pool = new Semaphore(maxThread, maxThread);
if the maxThread var was not in the command line, or is set to 0, then there is no limit to the max thread allowed, I decided to not initialize the semaphore in this case, later I check if the semaphore is not null in order to wait.
Next I initialize the data context for the LINQ-to-SQL Class with the string builder, I initialize the database then get the list of folders that will be scanned
DiskReport.DiskReportDataContext dc = new DiskReport.DiskReportDataContext(sb.ConnectionString);
if (!clearDB(dc)) return;
var q = from t in dc.TreeDetails
where t.Enabled == true
select new { t.id, t.RootFolder };
if (q.Count() == 0) return;
If the clearDB() function did not work, I abort the program, the scan will not work properly.
if there is no folder to scan, then exit.
Okay, once everything is ready, I initialize 4 arrays :
Tree[]
which will do the actual scan
ThreadStart[]
which will contain the reference to the thread start function in the Tree object
Thread[]
which will be the actual threading object
ManualResetEvent[]
will be used to alert the main thread each sub thread are finished.
Tree[] at = new Tree[q.Count()];
ThreadStart[] ats = new ThreadStart[q.Count()];
Thread[] ath = new Thread[q.Count()];
ManualResetEvent[] events = new ManualResetEvent[q.Count()];
int i=0;
foreach (var root in q)
{
if(root.RootFolder!="")
{
events[i] = new ManualResetEvent(false);
at[i] = new Tree(root.RootFolder, root.id, level, events[i]);
ats[i] = new ThreadStart(at[i].startProcess);
ath[i] = new Thread(ats[i]);
ath[i].Name = root.RootFolder;
ath[i].Start();
i++;
Thread.Sleep(5000);
}
}
WaitHandle.WaitAll(events);
The important point is to initialize the "Tree" Object with all the parameters I need for the "StartProcess()" function.
I also added a 5 seconds wait because of starting problems, I will talk about in a chapter below.
1.2 - UpdateACL()
private static void updateACL()
{
Console.WriteLine("- Update ACLs");
DiskReport.DiskReportDataContext dc = new DiskReport.DiskReportDataContext(sb.ConnectionString);
dc.sp_ClearACLs();
var rights = (from r in dc.FoldersACLs
select r.Rights).Distinct();
foreach (Int32 right in rights)
{
FileSystemRights fsr = (FileSystemRights)right;
string str = fsr.ToString();
if (right.ToString() == str) str = fsr.ToString("X");
dc.sp_InsertRight(right, str);
Console.WriteLine("insert {0} as {1}", right, str);
}
}
This function goal is to translate numeric values like 0x001F01FF into readable, human friendly information, in this case "FullControl".
In the case the right value scanned in the ACE is not translated to text, that can happen a lot with live data on old and long lived file servers, I translate the value into the hexadecimal string.
A stored procedure in the SQL server "dc.sp_ClearACLs();
" is there to make a bit of cleaning for most of this redundant and often not useful entries.
1.3 - UpdateSID()
private static void updateSID()
{
Console.WriteLine("- Update SIDs");
System.Security.Principal.SecurityIdentifier sid;
DiskReport.DiskReportDataContext dc = new DiskReport.DiskReportDataContext(sb.ConnectionString);
dc.CommandTimeout = 6000;
dc.sp_UpdateSIDList();
var SSDLs = from s in dc.SIDs select s.SID1;
foreach (string ssdl in SSDLs)
{
sid = new System.Security.Principal.SecurityIdentifier(ssdl);
string ntAccount = "";
try
{
ntAccount = sid.Translate(typeof(System.Security.Principal.NTAccount)).Value;
}
catch (Exception ex)
{
Console.WriteLine("{0} {1}", ssdl, ex.Message);
}
if (ntAccount == "") ntAccount = "<Unknow>";
var s = (from a in dc.SIDs
where a.SID1 == ssdl
select a).Single();
s.Name = ntAccount;
dc.SubmitChanges();
}
}
This function will translate SID string into readable NT Account name. (ex : "S-1-5-21-3872767328-3467091273-3605603707-1001" = "DESKTOP\Administrator" on my computer)
The first thing this function does is call the stored procedure sp_UpdateSIDList() that will scan the list of SIDs in the tables Files and Foldersand extract all the unique SID and store them in the SID table :
insert into SIDs (SID)
select DISTINCT owner
FROM files
where owner not in (select SID from SIDs)
Then it will try to translate with the function translate : ntAccount = sid.Translate(typeof(System.Security.Principal.NTAccount)).Value;
The 2 update update function (sid and acl) are called after the scan for performance purpose, the scan would not be viable if these translations were to be made in each insert.
ProcessCube()
This last function is really simple and can be resumed to :
Server s = new Server();
s.Connect(olapServer);
s.Databases[application].Process(ProcessType.ProcessFull);
Server is a member of the Microsoft.AnalysisServices namespace.
2 - Tree.cs
This class is where the work is done.
The constructor initializes the members of tree that will be used in the scan process, a small note on this line :
dc.ObjectTrackingEnabled = false;
this will allow the datacontext to track less objects, and prevent inserting lines in the tables using this kind of syntax Table<T> newline = new Table<T>();
I tried to do this, when changing the code from plain old 'INSERT INTO' sql queries to LINQ-to-SQL syntax, and it was a really bad idea.
Lots of memory leaks and extremely bad performances (from 10 minutes, I was running up to 12 hours and more)
I then switched to sql stored procedure to do the insert of files and folders in the database and I found I add the same performances as with the direct SQL command.
2.1 StartProcess()
This function is the one starting the thread, it starts by this line
if(Program._pool!=null) Program._pool.WaitOne();
that will wait if the number of allowed running thread is reached.
if the semaphore pool is null, I don't wait because there is no limit
Then it calls the initProcess() function which will determine if the path to scan is a local drive or a UNC path. If it is a UNC path it will map the path to a letter.
it will remove the last "\" of the path also for logging purpose.
Next it insert in the database the rootfolder and start the recursive function, described next, that will do the scan
it finally disconnect the network drive if needed and set the event telling the main thread it has finished.
2.2 processFolder(path, level, idparent)
this function will do the following things :
- list all files in the path and insert them in the sql database with all their attributes
- for each subfolders in the path
- insert the folder in the database
- get ACL and insert it in the database
- recursively calls itself with the subfolder as a parameter
there is a point to make about the date of the files. I came across millions of files and some had really strange of creation (from 1701-xx-xx to 2400-xx-xx, I guess coming from old systems, or detached from lotus mail) after a time, I found out that the last access time was never wrong (almost never) so I decided that if the creation or modif date was under 1950, i'll set the it to the last access date.
It's arbitrary, and can be changed to something else, but at least the files get to be inserted in the database, otherwise they're skipped, resulting in wrong data.
2.3 InsertFolder()
this function will insert a new folder in the database and return the new database id.
first I declare
int? id=0;
this will allow the parameter of type "OUTPUT" of the stored procedure to return the data, otherwise it will generate an error of incompatible type.
then it tries to get the folder owner. I had a strange problem here : even with a try-catch block in the getFolderOwner function, I was required to add another try-catch. I still don't know why.
the next test is here for the sake of the "first folder or not" to be inserted.
In the desing of the DB, and to allow a good parent-child table to work, the root folder must have a parent ID set to null and its name representing the full path
ie : \\Fileserve1\share
while the subfolder cannot have a parentID to null, they must not contain the fullpath in their name, just the short name.
Here is the resultant assignement of the the value 'name' :
string name = (idparent.HasValue) ?
folder.Substring(folder.LastIndexOf(@"\") + 1) :
folder;
in the call to the stored procedure there is this line
this.rootFolder + folder.Substring(2)
that will always represent the full path of the current folder wether it is a mapped drive or a local drive, a root folder or a subfolder.
2.4 - storeACL()
private void StoreACL(Int32 idFolder, string folder)
{
DirectoryInfo dinfo = new DirectoryInfo(folder);
DirectorySecurity dSecurity = dinfo.GetAccessControl();
AuthorizationRuleCollection returninfo = dSecurity.GetAccessRules(true, true, System.Type.GetType("System.Security.Principal.SecurityIdentifier"));
foreach (FileSystemAccessRule fsa in returninfo)
{
try
{
dc.sp_InsertFolderACL(idFolder,
fsa.IdentityReference.Value,
(int)fsa.FileSystemRights,
(fsa.IsInherited) ? 1 : 0,
(int)fsa.InheritanceFlags,
(int)fsa.PropagationFlags);
}catch (Exception ex)
{Console.WriteLine(ex.Message);}
}
}
This function was a pain to convert from old VB6 and kernel32/advapi API calls to clean .NET code. There is not much documentation and finding how the various objects were related to each other and finally get a FileSystemAccessRule object, which is an ACE (access control entry) was a very long and difficult process.
But there it is. You'll also notice the explicit conversion to (int) because these members of fsa are flag structures and will not be stored in the database as it is.
The FileSystemRights value is the one that gets translated in the updateACL function
2.5 - get owners
private string getFileOwner(string filename)
{
FileSecurity tmp = new FileSecurity(filename, AccessControlSections.Owner);
string owner = "<unknown>";
try
{
owner = tmp.GetOwner(System.Type.GetType("System.Security.Principal.SecurityIdentifier")).Value;
}catch { }
return owner;
}
This function also is the clean way of getting the owner of a file (or a folder) instead of calling win32 API through P/Invoke
3 - NetworkDrive.cs
I need to find where I got this nice piece of code and add the credit to the original developer.
I think I grabbed it from code-project, but I'm not sure, this will be updated when I find it.
this class is pretty self explanatory, it will map a UNC path to a drive. I modify the code to make the function "nextFreeDrive" public so I could use it the Tree object.
Points of Interest
This software is the base of other tools I'm working on right now.
There is a filefinder that allows me to search throught millions of folders where are the files I need (like all the .AVI and .MP3, forbidden on the network), looking for duplicate files, etc...
(this is possible through sql query, but I wanted something other could user)
I'm working on a Active directory user manager which allows admins to see wich folders a user has access to through groups
There is also the reporting it is possible to make wich OLAP and MDX query. Right now I don't know how it really works, and my only way to query the olap application is through excel.
I think more analysis can be done (example, all the home directory of the users stores the same folder for mail or other stuff, a well designed MDX query could show details for these folder only)
Given your needs this soft can be the foundation of many useful stuff.
OLAP Database
here is the design of the cube, the fact table and dimensions are views in the SQL server
here is an exemple of what it can look like on excel
More on the Database
As said in the introduction, the database is pretty simple as for the data it stores. But there would be a problem of performances, DB file size, tuning if it was just created like that.
For example, my real life DB (the one with the 15M files) is about 7GB big, there are lots of indexes I came up with as the number of files was growing and query performances got bad.
I came up with really early in the development with the fact that some tables were not permanent data and just needed to be dropped and recreated at runtime, I could use truncate table, but it didn't satisfy me as data files were not clean enough, I kept getting lower performance as files were getting fragmented and all in all it required lots of maintenance for a tool that was running the week end on its own.
I then redesigned database storage and finally got to where it is now. There is 1 stored procedure that will delete table, constraint and files and another one that will re-create the whole thing
So the database is now made of 8 files, one data and one log file storing permanents data, then for the tables Files, Folders and ACLs one file for the data and one file for the indexes.
My sql server is now happy, and query are running a tiny bit faster.
Future
In the next articles (or in this one) I'll add an administration interface, and try to come up with a packaged setup with all the possible options (database location, file size for the DB, schedule management and security, right now the account running the program must have access to the folders and the DB and does not allow multiple logins for the network shares )
History
v1 Initial release : please comment ! (any advice and critic will be well received, it is my first article) and forgive my english :)