 |
|
| .NET DotNet Forum Index » Visual C# Forum » Fastest way to sort... |
|
Page 1 of 1 |
|
| Author |
Message |
| Bill... |
Posted: Fri Oct 30, 2009 1:14 pm |
|
|
|
Guest
|
What is the fastest way to sort two columns and one million rows?
I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.
What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?
Other than the need to sort two columns, I have a lot of control over
the data source, etc. |
|
|
| Back to top |
|
|
|
| Bill... |
Posted: Fri Oct 30, 2009 2:49 pm |
|
|
|
Guest
|
Pete,
I have to keep the two data items(columns) together, which is why I
used a dataset/dataview.The sort must be on the first column only. I
tried doing this with a custom sort routine within the List class sort
method, but that was even slower than the dataset. I will look into
the Dictionary class to see what it can do. |
|
|
| Back to top |
|
|
|
| Bill... |
Posted: Fri Oct 30, 2009 3:54 pm |
|
|
|
Guest
|
Scott,
The Dictionary class did the trick. THANK YOU. It sorted over 3
million data items in 2.5 minutes, down from 8.5 minutes.
Quote:
Have you considered using a generic Disctionary(Of T) type? These generic
types support sorting and may prove to be quicker than sorting a DataSet via
a DataView.
-Scott |
|
|
| Back to top |
|
|
|
| Scott M.... |
Posted: Fri Oct 30, 2009 5:33 pm |
|
|
|
Guest
|
"Bill" <billsahiker at (no spam) yahoo.com> wrote in message
news:6c889d7f-725e-49dc-bcc0-fe7098e77fa4 at (no spam) f20g2000prn.googlegroups.com...
Quote: What is the fastest way to sort two columns and one million rows?
I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
SQL Server will generally yeild better performance results in all areas than
Access will.
Quote: I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.
The time it takes to do the sort is affectd by several things, not just the
..NET object that is containing the data. But, to be sure, sorting a million
records will take some time.
Quote:
What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?
http://dotnetslackers.com/Community/blogs/xun/archive/2008/04/15/sql-bulk-insert-and-ado-net-sqlbulkcopy.aspx
Quote:
Other than the need to sort two columns, I have a lot of control over
the data source, etc.
Have you considered using a generic Disctionary(Of T) type? These generic
types support sorting and may prove to be quicker than sorting a DataSet via
a DataView.
-Scott |
|
|
| Back to top |
|
|
|
| Peter Duniho... |
Posted: Fri Oct 30, 2009 6:27 pm |
|
|
|
Guest
|
Bill wrote:
Quote: What is the fastest way to sort two columns and one million rows?
In-memory quicksort. Which is what you get if you use .NET's built-in
sorting methods.
Quote: I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
If you need a database-like API for other reasons, using SQL Server
Express or similar could be helpful. I doubt it would speed up your
sort times though. Even if you could host the database on a different,
much faster computer, the i/o adding data to the database would probably
be prohibitive, consuming whatever gain you got by offloading the sort
routine.
Quote: I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method.
Why are you using a DataSet and DataView? Certainly one possible issue
is the use of overly complex data structures like that, rather than a
simply array or List<T>. If all you are doing is parsing input,
sorting, and then writing it back out, I would expect DataSet to be
overkill and possibly less efficient.
Quote: I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.
Frankly, a few minutes to sort a million records sounds pretty
reasonable to me. But then, I'm obviously a bit out-of-date. I did a
quick test using the plain Array.Sort() method, and I can sort 1 million
records in under 2 seconds.
Seems to me the most direct way to improve performance is to stop
sorting in the context of a database. :)
Quote: What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?
Other than the need to sort two columns, I have a lot of control over
the data source, etc.
How much control? Can you get the data to appear in the original text
file already in sorted order? That would allow for the best
performance. :)
Pete |
|
|
| Back to top |
|
|
|
| Vista Succubus Hunter... |
Posted: Fri Oct 30, 2009 6:30 pm |
|
|
|
Guest
|
Bill wrote:
Quote: What is the fastest way to sort two columns and one million rows?
I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.
What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?
Other than the need to sort two columns, I have a lot of control over
the data source, etc.
I would just use a List<T>.Sort.
<http://dotnetslackers.com/Community/blogs/simoneb/archive/2007/06/20/How-to-sort-a-generic-List_3C00_T_3E00_.aspx>
The Product class/object would be what you're sorting. Your sort object
will be a name you make/call it.
var products = New List<Product>;
in your reader loop of reading each record.
Loop
var product = new Product();
product.Name = "Test";
products.Add(product);
end loop
products.Sort();
foreach(var prod in products)
{
string name = prod.Name;
and do other processing
}
products.Clear();
You start it all over again.
Why do you need a database to do sort? It don't get any faster than a
sort in memory, which is what you're doing with List<T>.Sort().
And you don't have to use or deploy a database. |
|
|
| Back to top |
|
|
|
| Peter Duniho... |
Posted: Fri Oct 30, 2009 8:36 pm |
|
|
|
Guest
|
Bill wrote:
Quote: Pete,
I have to keep the two data items(columns) together, which is why I
used a dataset/dataview.The sort must be on the first column only. I
tried doing this with a custom sort routine within the List class sort
method, but that was even slower than the dataset. I will look into
the Dictionary class to see what it can do.
You must have a really slow computer, or you didn't implement the
List<T>-based sort correctly. On my computer (32-bit Windows 7 running
in a VM on a MacBook Pro Core 2 Duo 2.33) I can sort an array of 10
million int/string pairs in 20 seconds.
I also don't understand what you mean by using Dictionary. The
Dictionary<TKey, TValue> class doesn't sort items. The
SortedDictionary<TKey, TValue> class does, but it uses a binary tree
which has _at best_ the same performance characteristics as the
quicksort implementation of Array.Sort() or List<T>.Sort(), and since
you have all the overhead of the internal memory objects and operations
of the SortedDictionary<TKey, TValue> class, it's actually going to be
quite a bit slower.
The fact that you need to keep the two data items together isn't
relevant. You can easily create a simple struct to pair them, or just
use the KeyValuePair<TKey, TValue> struct if you like.
I suspect that using an array and sorting it is going to be, by far, the
most efficient way to solve your problem.
Pete |
|
|
| Back to top |
|
|
|
| Göran Andersson... |
Posted: Sat Oct 31, 2009 11:13 am |
|
|
|
Guest
|
Scott M. wrote:
Quote: Have you considered using a generic Disctionary(Of T) type? These generic
types support sorting and may prove to be quicker than sorting a DataSet via
a DataView.
A Dictionary<T> can't be sorted as it doesn't preserve the order of the
elements...
--
Göran Andersson
_____
http://www.guffa.com |
|
|
| Back to top |
|
|
|
| Göran Andersson... |
Posted: Sat Oct 31, 2009 11:29 am |
|
|
|
Guest
|
Bill wrote:
Quote: What is the fastest way to sort two columns and one million rows?
I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.
What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?
Other than the need to sort two columns, I have a lot of control over
the data source, etc.
Using a DataTable means a lot of overhead, compared to objects
specifically created to hold the data that you have.
I made this test class for holding an item:
private class Test {
public int Id { get; private set; }
public string Name { get; private set; }
public Test(int id, string name) {
Id = id;
Name = name;
}
}
Using this code I created a million random items, that I then sorted on
the numerical and textual properties:
List<Test> list = new List<Test>();
Random rnd = new Random();
for (int i = 0; i < 1000000; i++) {
list.Add(new Test(rnd.Next(), rnd.Next().ToString()));
}
Stopwatch w = new Stopwatch();
w.Start();
list.Sort((x, y) => x.Id.CompareTo(y.Id));
w.Stop();
Console.WriteLine("{0} ms.", w.ElapsedMilliseconds);
w.Reset();
w.Start();
list.Sort((x, y) => x.Name.CompareTo(y.Name));
w.Stop();
Console.WriteLine("{0} ms.", w.ElapsedMilliseconds);
Output:
881 ms.
4514 ms.
So, what you should expect for sorting a million objects in a List<T> is
a few seconds, not a few minutes.
--
Göran Andersson
_____
http://www.guffa.com |
|
|
| Back to top |
|
|
|
| vanderghast... |
Posted: Mon Nov 02, 2009 5:18 pm |
|
|
|
Guest
|
From a not so long away discussion we got, dotNet does not use the quicksort
method, or at least, not the basic algorithm of the method The basic
quicksort method is a divide and conquer method which does not compare more
than once any two elements together: in a first pass, only the pivot is
involve in comparison with other elements and once the pivot is positioned,
the two partitions only deal with their members, not with the pivot, neither
with the elements of the other partition. It has been established (by code
supplied by Random Coder) that dotNet Sort may compare two given elements
more than once, so dotNet Sort does not use the basic quicksort method,
although it may be strongly inspired from that method.
Vanderghast, Access MVP
"Peter Duniho" <no.peted.spam at (no spam) no.nwlink.spam.com> wrote in message
news:u$GVtDcWKHA.4360 at (no spam) TK2MSFTNGP04.phx.gbl...
Quote: Bill wrote:
(...)
In-memory quicksort. Which is what you get if you use .NET's built-in
sorting methods.
(...) |
|
|
| Back to top |
|
|
|
| Peter Duniho... |
Posted: Mon Nov 02, 2009 6:42 pm |
|
|
|
Guest
|
vanderghast wrote:
Quote: From a not so long away discussion we got, dotNet does not use the
quicksort method, or at least, not the basic algorithm of the method The
basic quicksort method is a divide and conquer method which does not
compare more than once any two elements together: in a first pass, only
the pivot is involve in comparison with other elements and once the
pivot is positioned, the two partitions only deal with their members,
not with the pivot, neither with the elements of the other partition. It
has been established (by code supplied by Random Coder) that dotNet Sort
may compare two given elements more than once, so dotNet Sort does not
use the basic quicksort method, although it may be strongly inspired
from that method.
From the documentation of Array.Sort(): "This method uses the QuickSort
algorithm".
From the documentation of List<T>.Sort(): "This method uses Array.Sort,
which uses the QuickSort algorithm."
I'm not sure which discussion you're talking about, but I don't recall
one in which it was established that a) .NET compare non-pivot elements
to each other at all, nor that b) a quicksort necessarily implies that
any given element is compared with any other given element only once.
Bear in mind that in quicksort, the pivot has to wind up in one or the
other of the partitions. Whether or not it winds up as the pivot again,
the pivot will necessarily be compared to at least one other element it
had already been compared with (i.e. at a minimum, the new pivot for
that partition, and every element in the partition if it should happen
to be selected as the pivot again).
The discussion I recall hinged on whether it was okay for a comparison
function to return random results. The answer is, of course, no. It
may not return random results because even if there was such a thing as
a perfect sort in which no two elements are ever compared more than
once, the comparison results need to be transitively consistent in order
for the sort algorithm to work (i.e. if A < B and B < C, then when you
compare A to C, it must compare as less than C).
In any case, the documentation clearly states the sort algorithm used,
and this can easily be verified by examining the implementation, either
via Reflector or Microsoft's published source code for .NET.
Pete |
|
|
| Back to top |
|
|
|
|
|
All times are GMT - 5 Hours
The time now is Sat Dec 05, 2009 1:37 pm
|
|