Open Food Facts Database

OpenFoodFacts logo

We will mainly use the Open Food Facts database, which is a collaborative open-access database. Users can contribute by taking a picture of any food product, the list of nutrients and thus provide useful information. The database contains more than 714’000 products at this day and this number is growing rapidly.

This collaborative database system makes it possible to quickly gather a huge amount of information. However, this also has other advantages. User entries can be incorrect, inaccurate or simply incomplete. Therefore, the database is far from being complete. Many products do not have full information because the user may not take a picture of the list of ingredients or sometimes also because the information for a particular product is missing on the packing (not public, differents laws, etc.). The OpenFoodFacts terms of use clearly state that the database may contain errors. As a consequence, the data provided can only be used for informative information and not for medical purpose.

Content of the database

We decided to focus on only part of the information available in the database. We will use :

Detection of errors

On this basis of information, we found several errors.

Some values were negative which is inconsistent with nutritional values. Then, for energy, we decided to consider as an error the products with more than 4’000 kJ. We fix this threshold according to the fact that is represented by the mean of energy multiply by 0.17 the standard deviation. The salt to sodium ratio of 2.5 is also not respected. As well as the fact that saturated fats cannot be higher than the total original fat. We also calculated theoretical energy based on fat, carbohydrates, protein and fibre to detect and correct these inconsistencies. You can see these detections in the section: Data visualization of our data story.

Filling

This error correction has already allowed us to pre-fill the database. But there was still a need to improve the missing data. So we built an algorithm based on the product tags.

Indeed, we look with the tags of the product to be completed, from the least abundant (and therefore supposed to be the most specific) to the most abundant, at the products in the rest of the database that shares the same tag. On the basis of similar products, we calculate the median for each nutritional value. With these values, we can complete the product if it had a missing value. We have had several versions of this algorithm, but the one used today is independent of the filling. Indeed, we calculate the medians before filling in the missing values for all products. This has also led to a reduction in the complexity of the algorithm.