Talk:Merge algorithm

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Nomenclature issue[edit]

I would like to know if it is standard nomenclature to call "merge algorithms" the ones that follow.

  • given a set of current account balances and a set of transactions, both sorted by account number, produce the set of new account balances after the transactions are applied; this requires always advancing the "new transactions" pointer in preference to the "account number" pointer when the two have equal keys, and adding all the numbers on either tape with the same account number to produce the new balance.
  • produce a sorted list of records with keys present in all the lists (equijoin); this requires outputting a record whenever the keys of all the p0..n are equal.
  • similarly for finding the largest number on one tape smaller than each number on another tape (e.g. to figure out what tax bracket each person is in).
  • similarly for computing set differences: all the records in one list with no corresponding records in another.

I checked several books on algorithms and in all of them "merge algorithm" is taking two or more ordered lists and producing one ordered list in linear time. Maybe the above algorithms could be in another article.

Pablo.cl 02:04, 24 May 2004 (UTC)[reply]


The only reference I've found so far that calls set difference a merge algorithm is Kragen Sitaker. I think the nomenclature is misleading, and that the examples in italics above, and also the middle one I copy below, should be in a separate article, tentatively called Sorted list algorithms.

I copy form kragen-hacks

Some examples of merge algorithms:

  • produce a single sorted list from multiple sorted lists. (This is the kind of merge used in mergesort.)
  • produce the set union, set intersection, or set difference of two sorted lists. [Set intersection and set difference would be moved to Sorted list algorithms.]
  • given a master file and an update file sorted in the same order, produce a new master file with all the updates applied.

Pablo.cl 01:50, 26 May 2004 (UTC)[reply]

Nomenclature response[edit]


Pablo may be correct; perhaps updating a sorted master file from a sorted list of updates is not a common meaning for the term "merge", but the term "merge" does seem to be used for it in the wild ([1] says:

 Batch Processing
 Batch processing occurs when files are processed off-line, i.e. 
 they are saved until some convenient time (e.g. at night) when the 
 computer system is not otherwise being used.
 
 Updating files: the updates, i.e. the changed records, are "batched" together, 
 i.e. saved, and sorted into key order.   Then the master file is read, one 
 record at a time, until a record with the same key as the first transaction is 
 found.  This master file record is then updated and then the process is repeated.
 The updating is done as follows:
 * For magnetic tape. Put the updates in a separate file and merge the master 
   and the update to give a new master.
 * For a disk, the easiest method is to associate a pointer with each record 
   leading to the next one (linked list). Individual records can be scrubbed 
   and replaced by reseting pointers. 
 In general, if only one master file is available, it is not altered, but a 
 new master is produced by merging.

Clearly in this context the new master file should not contain both the record from the old master file and the record from the update file, which would be the necessary consequence of interpreting "merge" as "produce multiset union".)

And perhaps the Sort-merge join in Oracle and other RDBMSs' query evaluation isn't a common meaning for the term either; but certainly the merge in that case isn't concerned with producing a multiset union, but a subset of the multiset cartesian product.

Certainly when Knuth mentions "the idea of merging" on page 385 of volume 3 2ed, he's talking about merging two sorted sequences (of punched cards) to get a sorted sequence containing all the items of both sequences --- a sorted multiset union. Pages 197-207 also appear to be talking about producing the multiset union.

In any case, each of these things I called "merge algorithms" when I wrote this page can be conceptualized as the composition of some other function over sequences and the ordinary sorting merge, the one that produces a multiset union.

The kragen-hacks reference carries no weight in this case, because kragen-hacks is written by the same person who promulgated this questionable nomenclature in this Wikipedia article in the first place: me. If we decide to accept the term "merge" for this larger class of algorithms beyond just multiset union, it would have to be on the basis of documented terminology usage by people besides me. In my eyes, the fact that Knuth uses the term with a clearly more restricted meaning in mind creates a substantial burden of proof.

-- User:Kragen

Minor change to python example[edit]

The example originally had a[0] < b[0] rather then a[0] <= b[0] as the basis for using the item from the a[] array. This would not result in a stable sort, since equal elements would have been selected from the b[] array before those from the a[] array. The change ensures stability.

It's a bit of a shame that one sees so many assertions that merge sort is stable. As the uncorrected example shows, it's easily possible to write an unstable, but otherwise valid, merge sort. -- jpl

about the examples[edit]

I removed the comments about the STL merge implementation being more efficient than the presented source code because there's no guarantee of the STL's implementation performance, or approachability, for any specific task. The assertion I removed doesn't apply in all cases, and can't be presented as fact without much more context and discussion.

Meanwhile, I am wondering why the examples don't implement the pseudo code presented at the beginning of the article. The examples have no comments and don't explain what their parameters are. The C and C++ versions don't bother to check the length of the target array.

Should I add comments, and rewrite the samples to more closely match the pseudocode? Mikeblas 21:48, 28 December 2005 (UTC)[reply]

Sample code[edit]

I intend to remove the sample implementations section if and when Wikipedia:Articles for deletion/Insertion sort implementations is successful. It may be worthwhile to transwiki them to WikiBooks, WikiSource, or the Literate Programming wiki if anyone feels strongly about them. —donhalcon 06:09, 5 March 2006 (UTC)[reply]

That AFD was successful, but there are still a couple of sample implementations. What's going on with requiring references for sample code in articles? -- Mikeblas 15:43, 5 September 2007 (UTC)[reply]
  • Despite the 2006 debate, sample implementations in the form of pseudo code to explain algorithms are part of both insertion sort and selection sort articles. Many other wiki articles use pseudo code as a method to explain algorithms in general, sorting, data compression, error correction code, data encryption, extended Euclidean algorithm, ... . I don't see why merge algorithm should be singled out for not having at least some explanation of the algorithm in the form of pseudo code. Rcgldr (talk) 19:00, 29 November 2015 (UTC)[reply]

BEST KNOWN MERGE ALGORITHM[edit]

I would like to know which is the fastest of knowns merge algorithms, I've read a lot... but I didn't find an answer. Trying to find an optimal sorting algorithm, I came to the conclusion that the optimal merge is a generalization of the binary search algorithm. Someone thinks that may be true, or even already known? I've writen in ADA this algorithm and if someone finds the best known merge algorithm, I will benchmark my proof of concept against. Meanwhile, I would try to compare it to clasic merge.--Azrael60 18:55, 26 September 2007 (UTC)[reply]

As a side note, I think a mention to Hwang-Lin binary merging algorithm should be done. Also the information theory limit for the worst case comparaison.--Azrael60 (talk) 16:05, 13 January 2008 (UTC)salem ghnaimat[reply]

Applications - statement about k-way sorts in memory[edit]

This section includes the statement: A k-way in-memory merge sort can also be constructed from a k-way merge algorithm, but his provides no benefits over the ordinary binary merge sort. On a processor with 16 registers, such as a PC in 64 bit mode, a 4-way merge sort using 10 or so pointers that a compiler optimizes by using registers for all those pointers will result in a 4-way merge sort being somewhat faster than a binary merge sort, and about as fast as quick sort. PC based benchmarks showed the 4-way merge sort was about 15% faster than binary merge sort. Rcgldr (talk) 19:19, 29 November 2015 (UTC)[reply]

I made the statement more precise by specifying that the number of comparisons is being counted. QVVERTYVS (hm?) 19:55, 29 November 2015 (UTC)[reply]
I would recommend just removing the statement. A 4 way merge averages 3 compares for every move, versus 2 way merge averaging 1 compare for every move, but the 4 way does 1/2 the moves. So the number of compares of 4 way is 1.5 times the number of compares for 2 way, while the number of moves for 4 way is 1/2 the number of moves for 2 way. This is probably too much detail for the main article, which is about merge algorithms as opposed to sort algorithms. Rcgldr (talk) 22:26, 3 December 2015 (UTC)[reply]
Removed it. We might want to add a section about this to merge sort later. QVVERTYVS (hm?) 09:49, 4 December 2015 (UTC)[reply]
  • To clarify - a k-way in memory merge sort (using min-heap, recursion, or something similar to reduce k-1 compares to log2(k) compares), has the same time complexity as a 2-way merge. For n = power of k, then complexity = n logk(n) log2(k) = n log2(n), same as 2 way regardless of k. However the statement is no benefit as opposed to same time complexity. Even with simple k-1 compare logic (3 compares per move), a 4 way merge sort of integers ends up being slightly faster than a 2 way merge sort, at least in the case of a PC in 64 bit mode (16 registers). The situation would probably be reversed if sorting an array of pointers to objects, since the compare overhead would be greater than the move overhead. The other issue is that a statement about the sort complexity belongs in the sort articles, instead of a merge article. Rcgldr (talk) 13:36, 7 December 2015 (UTC)[reply]
What the original source cares about (and what I sloppily summarized as "no benefit") is the exact expected number of comparisons, not the big-O complexity. QVVERTYVS (hm?) 15:22, 7 December 2015 (UTC)[reply]
The number of moves is also a factor, but the total number of operations remains the same, k=2, n log2(n) compares + n log2(n) moves = 2 n log2(n) total operations. k=4, 1 1/2 n log2(n) compares + 1/2 n log2(n) moves = 2 n log2(n) total operations. On my system, Intel 2600K 3.4ghz, in 64 bit mode (16 registers), a 4 way merge sort of psuedo random 64 bit unsigned integers is about 12% to 15% depending on the number of integers (12% for 4 million, 15% for 64 million). The 4 way merge sort on integers is about as fast as quick sort. The relative performance is affected by the overhead of compares versus the overhead of moves. Rcgldr (talk) 18:28, 7 December 2015 (UTC)[reply]
  • Computational complexity is not about efficient use of registers on a particular machine. Furthermore, comparisons are usually not simple integer operations even though books on algorithms use that for clarity. Glrx (talk) 04:24, 9 December 2015 (UTC)[reply]
As mentioned above, the issue with the removed statement was how it was worded no benefit as opposed to same time complexity, and it probably belongs in sort algorithms and/or merge sort, not merge algorithms. Rcgldr (talk) 02:31, 12 December 2015 (UTC)[reply]

Fake algorithms[edit]

A source code for merge_int_dac routine (consequently, for merge_int_dac_mp, too), added in this revision by Mia0303 under the Merge algorithm#Parallel merge section is incorrect. It would not even compile due to at least two major errors, let alone running and producing reasonable results!

Even when fixed, it will also be dramatically ineffective. Additionally it lacks description of the parameters, so it's hard to tell how it should be invoked. One can try to guess the parameters meaning, but that would mean reinventing the algorithm rather than learning it, which strongly contradicts an encyclopedic style. --CiaPan (talk) 08:52, 9 February 2016 (UTC)[reply]

I've replaced the whole section by a summary of CLRS's parallel merge. QVVERTYVS (hm?) 10:59, 9 February 2016 (UTC)[reply]

Merge K-way merge algorithm article into this one[edit]

The article K-way merge algorithm largely duplicates content in the K-way merging section in this article, so I think it should be merged (this is not just a pun, I promise). I also suspect that the topic is not notable enough to merit its own article. — PCB 21:59, 9 December 2017 (UTC)[reply]

I agree, the two articles have far too much overlap to be separate. Possibly if this article were expanded in a different direction the topics could be separately notable, but with the current content they are not. —David Eppstein (talk) 22:13, 9 December 2017 (UTC)[reply]
  • Oppose The article is written in appropriate WP:SUMMARYSTYLE, the K-way merging section summarize this article while this article gives significantly more details Trialpears (talk) 11:25, 15 May 2019 (UTC)[reply]
Closing, given the uncontested policy-based argument. Klbrain (talk) 13:23, 23 June 2019 (UTC)[reply]

Language support - C++ - std::inplace_merge[edit]

The term in place may be mis-leading here. std::inplace_merge may use auxiliary storage to implement the merge. It can throw an exception if it can't allocate enough memory.[1]. Microsoft / Visual Studio attempts to allocate memory, but can perform the merge even if memory allocation fails.[2].

References

Possible typo[edit]

should not it be a s+1 there? (the lower line beneath): in parallel do

       merge(A[i...r], B[k...s], C[p...t])
       merge(A[r+1...j], B[s...ℓ], C[t+1...q])

— Preceding unsigned comment added by 37.47.172.117 (talk)

Difference between the merge algorithm and merge sort?[edit]

What is the difference between the merge algorithm and merge sort? The article says that the merge algorithm plays a critical role in the merge sort algorithm, but according to how the article describes the merge algorithm, it seems to me that it is the same thing as merge sort. So, what is the difference between the two algorithms, if there is a difference? I think this should be clarified early on in the article. —Kri (talk) 17:52, 23 August 2023 (UTC)[reply]

@Kri: The merge algorithm takes two sorted chains (lists, sequences, arrays, ...) of data and creates a new chain (sometimes by copying the items of source data, sometimes by interleaving them, depending on the data structure in use), containing all items of both source chains in the proper order. Merge sort takes unsorted chain of data, decomposes it into multiple chains and successively merges them to construct a finał sorted chain of all source data. --CiaPan (talk) 20:14, 23 August 2023 (UTC)[reply]
Yes. The merge algorithm is a subroutine in merge sort, not the whole merge sort. It is also used as a subroutine in some other algorithms unrelated to sorting. —David Eppstein (talk) 01:14, 24 August 2023 (UTC)[reply]
@David Eppstein: Thank you for clarification. CiaPan (talk) 10:33, 29 August 2023 (UTC)[reply]
Thank you for the answers; that makes sense! I have updated the text under the image in the article to reflect that. —Kri (talk) 19:42, 29 August 2023 (UTC)[reply]
@Kri: What David said about a 𝄪subroutine𝄪 above applies to words 𝄪successively merges them𝄪 in my answer. --CiaPan (talk) 10:33, 29 August 2023 (UTC)[reply]