Quantcast
Channel: Match observations in two dataframes in R - Stack Overflow
Viewing all articles
Browse latest Browse all 4

Match observations in two dataframes in R

$
0
0

I have two dataframes. I want to use elements from one dataframe to search through a column from the other dataframe. And I need to narrow down this dataframe by the matches. And then continue narrowing down element by element. Look to the sample code, which can explain better.

df1    col1   1      apples      2      oranges     3      apples    4      banana  5      grapes6      mangoes7      oranges8      banana

df1 has only one column in it. Meanwhile df2 has 2 columns in it. setID & col1

df2 setID   col11   1   apples      2   1   oranges     3   1   oranges4   1   mangoes5   1   grapes6   1   banana  7   1   banana8   1   apples    10  2   apples      11  2   oranges     12  2   apples    13  2   banana  14  2   grapes15  2   mangoes16  2   banana17  2   oranges18  3   apples      19  3   banana  20  3   oranges     21  3   apples    22  3   grapes23  3   mangoes24  3   oranges25  3   banana26  4   apples      27  4   oranges     28  4   apples    29  4   grapes30  4   grapes31  4   oranges     32  4   banana  33  4   banana

As you can see there are some repeating setIDs. They mark one set. The order of the set is important. Please note that the df1$col1 does not have to be the same length as a set from df2. Nor do they have to be an exact match. They just have to be a close enough match. In this case df1$col1 is closest a match to df2$setID = 2 with only the last two elements out of order. The reason why they dont have to be an exact match is because I want to use a "search as you type" approach. I do not want to match df1$col1 as it is to a setID on df2. I want to narrow down the possible set by going through element by element. Assume that you get the elements of df1 one by one and not as a complete dataframe. For example:

Find a match for df1$col1[1] from df2 and save any sets that contains the match to a tempdf. It doesnt matter if a match for df1$col1[1] is found more than once in the same set. If it is found at least once then that set will be added to tempdf.

What needs to be retrieved at the end is a setID that corresponds to the set that matches as close to df1. In this case the tempdf will be the same as df2 as all the sets include "apples". Next will be what matches df1$col1[2] against the tempdf given that the first element is a match. I guess df1$col1[1:2] from tempdf. This results in:

tempdf  setID   col11   1   apples      2   1   oranges     3   1   oranges4   1   mangoes5   1   grapes6   1   banana  7   1   banana8   1   apples    10  2   apples      11  2   oranges     12  2   apples    13  2   banana  14  2   grapes15  2   mangoes16  2   banana17  2   oranges26  4   apples      27  4   oranges     28  4   apples    29  4   grapes30  4   grapes31  4   oranges     32  4   banana  33  4   banana

Basically setID = 3 is omitted. As this continues with the 3rd element from df1 the new tempdf will contain only setID 2 & 4. The loop (my thinking to solve this) would end once only one setID remains, in this case setID = 2. Therefore setID = 2 would be considered as a close match for df1.

Of course feel free to advice on a better approach than this one.


Viewing all articles
Browse latest Browse all 4

Latest Images

Trending Articles





Latest Images